home *** CD-ROM | disk | FTP | other *** search
Text File | 1997-03-15 | 320.5 KB | 7,246 lines |
- ------------------------------------------------------------------------------
- Inform 6.14
-
- Technical Manual
-
- Graham Nelson
- 27th April 1996
- revised on the following dates:
- 5th May, 10th May, 6th September, 23rd September, 16th December 1996,
- 27th January 1997, 26th March, 5th April, 8th September
- ------------------------------------------------------------------------------
- 1 The source code
- 1.1 Inventory
- 1.2 Map
- 1.3 Naming conventions
- 1.4 Typedef-named types
- 2 Porting Inform to a new environment
- 2.1 Dependence on the OS
- 2.2 Portability issues: the types int32 and uchar
- 2.3 The character set and the format of text files
- 2.4 The OS definitions block in "header.h"
- 2.5 Running Inform in a multi-tasking OS
- 3 Front end
- 3.1 The ICL interpreter
- 3.2 Fundamental method
- 4 Lexical analysis
- 4.1 Aim and structure of the lexer
- 4.2 Level 1: lexical blocks and buffered input
- 4.3 Level 2: breaking text into lexemes
- 4.4 Representation of tokens
- 4.5 Level 3: identifying identifiers
- 4.6 Hash-coding and comparing names
- 4.7 The symbols table
- 4.8 Token lookahead: the circle
- 4.9 Summary of the token output
- 5 Syntax analysis 1: the top-down structural parser
- 5.1 Aim and structure of the syntax analyser
- 5.2 Predictive parsing
- 5.3 The context-free grammar
- 5.4 Assigning values to symbols
- 5.5 Outputs other than assembly language
- 5.6 Assembly operands
- 5.7 Translation to assembly language
- 5.8 Summary of assembly language instructions output
- 6 Syntax analysis 2: the bottom-up expression parser
- 6.1 Map and structure
- 6.2 The operator precedence grammar
- 6.3 Lexical level 4: tokens to etokens
- 6.4 Resolution of ambiguous tokens
- 6.5 Constant folding
- 6.6 Checking lvalues and simplifying the parse tree
- 6.7 Summary of parse tree output
- 7 Code generation from parse trees
- 7.1 Aim and structure of the code generator
- 7.2 Annotation for conditions
- 7.3 Main traversal
- 7.4 Void context
- 7.5 Results of subexpressions
- 7.6 Conditional and logical operators
- 7.7 System functions
- 8 Assembly of code, text and data structures
- 8.1 Assembly of initial code
- 8.2 Branch offset backpatching and optimisation
- 8.3 Text translation: ISO and Unicode resolution
- 8.4 Dictionary management
- 8.5 Format of tables unspecified by the Z-machine
- 8.6 Grammar version numbers GV1 and GV2
- 8.7 Value backpatching
- 8.8 Packed address decoding
- 9 Run-time veneer
- 9.1 Services provided by the veneer
- 9.2 Properties 2 and 3, "ofclass" and "metaclass"
- 9.3 Class numbering and class objects
- 9.4 Individual property identifiers
- 9.5 Individual property tables
- 9.6 Availability of symbol names at run-time
- 10 Module compilation and linking
- 10.1 Model
- 10.2 Linking a module
- 10.3 Imports and exports
- 10.4 Backpatch backpatching
- 10.5 How modules differ from story files
- 10.6 Restrictions on what modules may contain
- 11 Service levels
- 11.1 Variables and arrays
- 11.2 Memory allocation and deallocation
- 11.3 Error messages
- 11.4 File input/output
- 12 Low-level language features
- 12.1 Using the "Trace" directive
- 12.2 System constants and other secret syntaxes
- 12.3 The "Zcharacter" directive
- 12.4 Sequence points
- 12.5 Format of debugging information files (for Infix)
- 13 What little I remember
- 13.1 Publication history
- 13.2 Design history
- 13.3 Implementation history
- 13.4 Modification history
- ------------------------------------------------------------------------------
-
- 1 The source code
- ---------------
-
- 1.1 Inventory
- ---------
-
- Inform is written in portable ANSI C and the source code is divided into 21
- files of code, called "sections", plus 1 #include file of linkage, constant
- and type definitions. These files are:
-
- arrays.c asm.c bpatch.c chars.c directs.c errors.c
- expressc.c expressp.c files.c inform.c lexer.c linker.c
- memory.c objects.c states.c symbols.c syntax.c tables.c
- text.c veneer.c verbs.c header.h
-
- Note that all their names fit into the 8.3 filenaming convention. The
- subdivision into 21 sections is intended to ensure that each .c file can be
- compiled into one linkable object file: under some C compilers object
- code files cannot exceed 64K in length. On my machine, expressc.o (the
- object code derived from expressc.c) is the largest, at 40K (about a third of
- which is the static table of operator data).
-
- A concise tree giving the structure of the source code follows. A section
- name is given brackets when it has already been given on a previous line: so
- "(inform.c)" is intended to be read as "now back to inform.c again". A name
- written as a function, like "compile()", is indeed a function, whose
- arguments are omitted here. Text in square brackets indicates the presence
- of interesting tables of static data. Finally, note that this structure
- is not absolutely rigorous: the error-reporting routines in "errors.c", for
- example, are called from all over Inform, not just from the lexical analyser.
-
-
- inform.c ICL parser: switches, filename translation, path variables
-
- memory.c ICL memory command parser: sets memory settings.
-
- (inform.c) compile(): polling all sections to manage variables,
- allocate and free arrays
-
- syntax.c syntax analyser, top level
-
- lexer.c lexical analyser: converts source text to a stream of
- tokens
- [Tables of all Inform keywords]
-
- chars.c performs character set translations
-
- files.c reads source code files into buffers for the lexer;
- performs miscellaneous file I/O
-
- symbols.c keeps table of symbol names found in the source;
- recognises from and adds to this table
-
- errors.c issues error, fatal error and warning messages
-
- (syntax.c) parse_program(): top level routine in syntax analyser
-
- parse_directive()
-
- directs.c parse_given_directive(): parses and obeys the easier
- directives; manages conditional compilation; delegates
- directive parsing down to other sections for harder
- cases:
-
- text.c make_abbreviation(): Abbreviate
-
- arrays.c make_global(): Array, Global
-
- objects.c make_attribute(): Attribute
- make_property(): Property
- make_class(): Class
- make_object(): Object
-
- verbs.c make_fake_action(): Fake_Action
- make_verb(): Verb
- extend_verb(): Extend
-
- linker.c link_module(): Link
-
- (symbols.c) assign_symbol(): giving value and type to symbols
-
- (syntax.c) parse_routine()
-
- parse_code_block()
-
- states.c parse_statement(): assigns ".Label" labels, parses
- and generates code for statements
-
- parse_action(): handles <Action ...> statements
-
- parse_print(): handles print and print_ret
-
- asm.c parse_assembly(): handles assembly language
- source code (preceded by "@")
-
- expressp.c parse_expression(): parses all expressions
- (including constants), conditions and
- assignments.
-
- expressc.c [Table of operators]
- code_generate(): generates code from parse
- trees previously found by parse_expression()
-
- (asm.c) assemble_2_to() (and many other similarly
- named routines)
-
- [Database of all Z-machine opcodes]
- assemble_instruction(): assembles a single
- Z-code instruction
-
- assemble_label_no(): puts label N here
-
- assemble_routine_end(): finishes routine,
- backpatches and optimises branches
-
- (text.c) compile_string(): translates ASCII text
- to Z-encoded text
-
- the dictionary manager
-
- the abbreviations optimiser
-
- (chars.c) performs character set translations
-
- veneer.c compile_veneer(): compiles in any whole routines
- needed by the rest of the compiled code
- [Table of Inform source code for veneer routines]
-
- tables.c construct_storyfile(): glues together all the
- code, dictionary, object tree, etc. into a story file
-
- bpatch.c backpatch the code in the light of recent knowledge
-
-
- 1.2 Map
- ---
-
- Here follows a map of the Inform archipelago, marking the inhabited islands,
- their shipping lanes and chief imports and exports:
-
-
- command line
- and/or ICL files in
- |
- | ICL commands
- \|/
- +----------+ FRONT
- | inform.c | END
- | memory.c |
- +----------+
- |
- | filenames
- |
- \|/ +------------+ LEXICAL
- ------> files.c -----> | lexer.c | ANALYSER
- source chars | symbols.c |
- code in +------------+
- /|\ |
- symbol | |
- values | | tokens
- | \|/
- +------------+
- SYNTAX | syntax.c | -----+------> asm.c --->---\
- ANALYSER: | states.c | / assembly /|\ initial|
- STATEMENTS | . | / language | code|
- | ---------- | / | |
- @ ASSEMBLY | asm.c | -/ | |
- | ---------- | | |
- | . | parse trees | \|/
- EXPRESSIONS | expressp.c | --+-------> expressc.c asm.c
- | (map 6.1) | \ |
- | . | \ |
- | . | \ TEXT |
- | ---------- | strings +-------+Z-text |
- DIRECTIVES | directs.c | \---->|text.c |-->---)|(--\
- | . | |chars.c| | |
- | . | +-------+ | |
- | . | dictionary| | |
- | . | alphabets \|/ \|/ |
- | . | | | |
- | arrays.c | ------->------\ | | |
- | . | array area | | raw| |
- | . | | | Z-code| |
- | objects.c | ----->-----\ | | | |
- | verbs.c | objects | | | | |
- +------------+ | | | | |
- | \|/\|/\|/ \|/ |
- | grammar +----------+----------+ |
- \---------------> | tables.c bpatch.c | |
- +----------+----------+ |
- | OUTPUT | |
- | | |
- Z-machine \|/ Z-code| |
- up to | | |
- code area | \|/ \|/
- \-----------> files.c
- |
- |
- \|/
- story
- file out
-
- (For clarity, the linker and a few smaller tables are missed out; and the
- "service" sections of Inform, such as "errors.c" and the allocation code in
- "memory.c", are missed out since they are, so to speak, pipes and sewers
- which lie beneath the surface of the ocean.)
-
-
- 1.3 Naming conventions
- ------------------
-
- The "header.h" makes over 700 constants using #define.
-
- These are mainly in capital letters and are followed by _ and then some short
- code indicating what kind of constant is being defined: for instance,
- NUMBER_TT means "the token type <number>". We write *_TT for the set of
- constants ending in _TT. Similarly, though to a lesser extent, groups of
- related variables and routines have grouped names.
-
- -------------------------------------------------------------------------
- Set of constants Used for
- -------------------------------------------------------------------------
- *_Extension File extensions, such as ".z5", used if the
- host OS supports them
- *_Directory Initial values for the ICL path variables
- (e.g., default pathname where story files are
- written to)
-
- *_TT Token types
-
- *_CODE Token values for statement and directive names
- *_COND Token values for textual condition names
- *_SEGMENT Token values for object definition segment names
- *_MK Token values for misc statement keywords
- *_TK Token values for "trace" directive keywords
- *_DK Token values for misc directive keywords
- *_SC Token values for system constant names
- (* is written in lower case)
- *_SYSF Token values for system function names
- [In all of the above eight cases, * is the name
- of the statement, keyword, etc. referred to,
- written in upper case except as specified above]
-
- *_SEP Token values for separators
- (the name sometimes reflects the text, e.g.,
- DARROW_SEP for the double-length arrow "-->";
- sometimes its use, e.g. NOTEQUAL_SEP for "~=")
-
- *_OP Token values for operators
- *_A Associativity values for operators
- *_U "Usage" (infix, prefix or postfix) values for
- operators
-
- *_T Symbol types
- *_SFLAG Symbol flags
- (bitmasks containing one bit set, so that
- (sflags[i] & *_SFLAG) is true if flag is set)
-
- *_CONTEXT Contexts in which the expression evaluator
- can run (e.g., "void", "condition")
-
- *_zc Internal numbers referring to Z-machine opcodes
- (* is the Standard 0.2 name for the opcode,
- written in lower case)
- *_OT Assembly operand types
- *_STYLE Two constants used to set whether the Z-machine
- has "time" or "score/moves" on its status line
- *_ZA Z-machine areas: e.g. PROP_ZA refers to the
- property values table area
- *_MV Marker values
- *_VR Veneer routines (* is the name of the routine,
- usually in mixed case)
-
- *_DBR Record types in debugging information files
-
- -------------------------------------------------------------------------
- Set of variables Used for
- -------------------------------------------------------------------------
- *_switch Flag indicating whether a command-line switch
- such as -s is on or off
- *_setting Numerical value set by a command-line switch
- such as -t3
- MAX_* A limit on something: note that a few of these
- are #define'd but most are memory setting
- variables
- no_* Number of things of this type made so far
- max_* Maximum number of things of this type made
- token_* Three variables used to hold the value, type
- and lexeme text for the last token read
- *_trace_level 0 if tracing information is not being printed
- out about *; otherwise, the larger this is,
- the more output is produced
- *_offset Byte offset in the Z-machine, either from the
- start of this Z-machine area or from the start
- of Z-machine memory
- *_top Pointer marking the current end in some uchar
- array (usually holding a Z-machine area being
- put together)
-
- -------------------------------------------------------------------------
- Set of routines Used for
- -------------------------------------------------------------------------
- init_*_vars() Routine in which section * of Inform initialises
- its variables to begin compilation
- *_begin_pass() ...and in which it initialises its variables
- at the start of the source code pass
- *_allocate_arrays() ...and in which is allocates any memory or
- arrays it needs to begin compilation
- *_free_arrays() ...and in which it deallocates any memory or
- arrays it has allocated, after compilation
- parse_*() Routine in the syntax analyser to parse the
- source code construct *
- assemble_*() Instructing the assembler to generate an
- instruction:
- assemble_#() (where # is a number from 0 to 4) an instruction
- with # operands
- assemble_#_to() an instruction with #operands which stores a
- result
- assemble_#_branch() an instruction with #operands which makes a
- conditional branch
- *_linenum() Keeping and writing line references to the
- debugging information file
-
-
- 1.4 Typedef-named types
- -------------------
-
- -------------------------------------------------------------------------
- Typedef name Where defined Used for
- -------------------------------------------------------------------------
- int32 H signed 32-bit integer
- uint32 H unsigned 32-bit integer
- uchar H unsigned char
-
- assembly_operand H holding Z-machine numbers in the form
- used in Z-code, together with linkage
- information about how they were
- calculated
- assembly_instruction H a convenient representation of an
- instruction of Z-code to assemble
- opcode asm.c everything about a Z-machine opcode:
- how many operands it has, whether
- branch or store, etc.
- verbl H grammar line of 8 token values
- verbt H grammar table of grammar lines
- prop H list of values for a property
- propt H property values table
- fpropt H the same, but with attributes too
- objectt H object tree-position and attributes
- dict_word H Z-text of a dictionary word
- dbgl H source code reference used for the
- debugging file (to file, line, char)
- keyword_group H the plain text of a group of keywords
- (such as: all the statement names)
- token_data H lexical tokens
- expression_tree_node H node in a parse tree produced by
- the section "expressp.c"
- operator H everything about an operator: its
- name, how to recognise it, its usage
- and associativity, etc.
-
- memory_block H an extensible area of memory
- (allocated in 8K chunks as required)
-
- tlb_s text.c used in abbreviations optimiser
- optab_s text.c used in abbreviations optimiser
-
- FileId H filename and handle for a source file
- ErrorPosition H filename and line reference for
- error message printing purposes
- LexicalBlock lexer.c name, line count, etc.
- within a block of text being lexed
- Sourcefile lexer.c buffer, pipeline and lexical block
- for a source code file being lexed
- ImportExport linker.c holds import/export records
- Marker linker.c holds marker records
- VeneerRoutine veneer.c holds low-level Inform source code
- for a veneer routine
- -------------------------------------------------------------------------
- "H" is an abbreviation here for "header.h"
-
-
- 2 Porting Inform to a new environment
- -----------------------------------
-
- 2.1 Dependence on the OS
- --------------------
-
- Strenuous efforts have been made over the last three years to make the source
- as operating-system independent as possible. As a general principle, mostly
- adhered to, all operating-system differences should be in the "header.h"
- file, and not in the 20 sections.
-
- As a general rule, for each target OS which Inform is being ported to, a new
- #define name is invented. For example, the name LINUX is used for the Linux
- port. When Inform is being compiled, only that one symbol will be defined,
- and none of the others: thus
-
- #ifdef LINUX
- ...code...
- #endif
-
- compiles the given code only for the Linux port.
-
- There are some very poor "ANSI C" compilers out there, and many more mediocre
- ones (which almost obey the standard, but don't quite): in any case the ANSI
- standard is very broadly defined. For example, the code
-
- int x;
- x = 45*1007;
- printf("%d\n", x);
-
- is entirely ANSI compliant, but results in different numbers being printed
- on different machines, due to the fact that ANSI does not specify the range
- of numbers which a variable of type int can safely hold. Since C is so highly
- unportable a language, and since some of the compilers used to produce Inform
- are poor, the whole Inform code has to be written with the worst possible
- compiler in mind.
-
- An illustration of this is that all preprocessor commands, such as #define,
- must begin on column 1 of the source code: even when they occur in code which
- is #ifdef'd out. VAX C (a particularly bad compiler) will reject
-
- #ifndef VAX
- #define FROG 2
- #endif
-
- for example, even when VAX is defined. This makes the declarations in
- "header.h" annoyingly illegible for everybody.
-
-
- 2.2 Portability issues: the types int32 and uchar
- ---------------------------------------------
-
- The main issues when porting Inform have been found to be:
-
- (a) the size of "int",
- (b) whether "char" is unsigned or signed by default,
- (c) what conventions apply to filenames,
- (d) assumptions about sizeof() when casting pointers,
- (e) how parameters (switches, filenames, etc.) are passed to Inform.
-
- (a) ANSI requires that "int" be at least 16 bit, though advances in CPU
- technology mean that under most of today's environments it will in fact be
- 32 bit. ANSI further requires that "long int" be at least as big as "int",
- but not that it has to be any bigger.
-
- Inform needs at least one integer type to be able to hold 32 bit signed
- values, and "header.h" contains code which attempts to typedef the name
- "int32" to such a type. This should happen automatically.
-
- Under ANSI rules, as the above says, a compiler need not have any integer
- type larger than 16 bits: if so, that compiler will not be able to
- compile Inform.
-
- An annoying issue here is that compilers vary widely in the extent to
- which they give errors or warnings when they detect silent "promotions"
- from one integer type to another. This makes it very hard to sort out
- the types of every object in the program between int and int32 so that
- everybody is happy: in practice, every time a new release of the source
- code has been made, a few dozen types have had to be fiddled with until
- everybody can compile it.
-
- (b) Compilers seem to divide about fifty-fifty on this. Again, the original
- standard is vague: the issue is really about how the value (char) 253,
- say, should be interpreted when it is cast to (int). Should the answer
- be 253, or -3? ANSI did not specify this because compiler writers
- wanted to be able to choose whichever could be done instantly on the
- CPUs they were working with. (If you store a char in the bottom 8 bits
- of a 32 bit register, then casting the value (char) 253 to (int) -3
- means setting all 24 of the upper bits, which requires code to be
- compiled.)
-
- Inform uses a typedef'd type called "uchar" when it needs an unsigned
- char type: it uses plain "char" when it doesn't mind. It never needs
- a signed char type.
-
- In theory ANSI C compilers must recognise the keywords "signed" and
- "unsigned", but some don't:
-
- typedef unsigned char uchar;
-
- actually produces an error on some compilers. So the typedef can only
- be made with your help. (On other compilers, "unsigned" is legal but
- "signed" is illegal.)
-
- (c) Many people think that the minimal 8.3 convention will work on any
- operating system, but this is not true (it won't work under Acorn's
- RISC OS). Much of each OS specification in "header.h" is therefore
- to do with filenaming.
-
- (d) For instance,
-
- sizeof(char *) sizeof(int *) sizeof(int32 *) sizeof(int)
-
- may all be different numbers on machines with segmented memory maps.
- This being so, casting between pointer types may lose information, and
- a few arrays in the source have surprising types to ensure safety.
-
- One thing Inform does need to be able to do is to subtract one pointer
- (of the same type) from another: it defines the macro
-
- subtract_pointers(X, Y)
-
- to do this. X and Y are normally of type uchar; there seems to have been
- no problem with this in practice.
-
- (e) The ANSI standard is quite good on the command line, and Inform expects
- to read parameters by the standard argc, argv mechanism. Unfortunately
- the Macintosh, for instance, has no orthodox command line. Such a port
- probably wants to have an "outer shell" which displays a window, allows
- options to be set and then calls the Inform 6 core as needed.
-
- The section "inform.c" normally compiles a "main" routine which makes
- a few machine-dependent changes and then passes its arguments straight
- on to "sub_main". For instance, here's the v6.10 source:
-
- int main(int argc, char **argv)
- { int rcode;
- #ifdef MAC_MPW
- InitCursorCtl((acurHandle)NULL); Show_Cursor(WATCH_CURSOR);
- #endif
- rcode = sub_main(argc, argv);
- #ifdef ARC_THROWBACK
- throwback_end();
- #endif
- return rcode;
- }
-
- The Macintosh Programmer's Workshop port is making multi-tasking
- work before embarking on compilation; the Acorn Desktop Debugging
- Environment port is tidying up after any error throwbacks, at the
- end of compilation. The point is that here is the place for such
- minor machine quirks.
-
- However, if you want an entirely new front end (such as Robert Pelak's
- Macintosh port of Inform has), then you need to define
-
- #define EXTERNAL_SHELL
-
- in your machine definition block (see later). This will mean that
- no "main" routine is compiled at all from "inform.c" (so you can
- simply link the Inform source into your own code, which will
- contain its own "main.c"): Inform should be run by calling
-
- extern int sub_main(int argc, char **argv);
-
- having set up argc and argv suitably. For instance, the outer shell
- might take names typed into dialogue boxes, and various ticked options
- on a window, and make these into a series of ICL commands, which
- are then handed over textually to sub_main. I suggest that the most
- efficient way to do this is to write them as an ICL file somewhere
- and to pass sub_main a single parameter telling it to run this
- ICL file.
-
-
- 2.3 The character set and the format of text files
- ----------------------------------------------
-
- The Inform source code assumes that the compiler is running on a machine
- whose character set agrees with ASCII in the range $20 to $7e. (This
- allows both plain ASCII and any of the ISO 8859 extensions to ASCII.)
-
- ASCII is now universal, but there is no common format for plain text files,
- and in particular how lines of text are ended. For example:
-
- MS-DOS, Windows, etc.: $0d $0a
- Mac OS: $0d
- RISC OS: $0a
-
- Inform 6 can read source code files in all these formats, and which further
- use any of the character sets above: plain ASCII or ISO 8859-1 to -9.
- (This is configurable using the -C switch.)
-
-
- 2.4 The OS definitions block in "header.h"
- --------------------------------------
-
- Each Inform port makes a block of definitions in the header file. These
- blocks take a standard format.
-
- Firstly, the block is put in #ifdef's so that it will only be processed
- in this one port. The block is divided into 6 sections, as follows.
-
- /* 1 */
-
- MACHINE_STRING should be set to the name of the machine or OS.
-
- /* 2 */
-
- Section 2 contains some miscellanous options, all of which are on/off:
- they are by default off unless defined. The possibilities are:
-
- USE_TEMPORARY_FILES - use scratch files for workspace, not memory,
- by default
- EXTERNAL_SHELL - this port is providing an entire external
- front end, with its own "main" routine:
- see above
- PROMPT_INPUT - prompt input: ignore argc and argv, instead
- asking for parameters at the keyboard.
- (I hope people will write front-ends
- rather than resort to this, but it may be
- a useful staging post.)
- TIME_UNAVAILABLE - if the ANSI library routines for working out
- today's date are not available
- CHAR_IS_SIGNED - if on your compiler the type "char" is signed
- by default
-
- Note that defining USE_TEMPORARY_FILES does not make a mandatory choice
- (as it did under Inform 5): whether to use allocated memory or temporary
- files is selectable with -F0 (files off) or -F1 (files on) in ICL.
- All that this option does is to define the default setting for this -F
- switch.
-
- Running -F0 is faster (possibly, depending on whether your C library
- provides buffering or not, much faster) but consumes 100 to 300K more
- memory (it does so flexibly, allocating only what it needs, unlike the
- Inform 5 option). Most users will not want to understand the issues
- involved here, so please make a sensible default choice for them.
-
- Once again, note that CHAR_IS_SIGNED must be defined if "char" is signed:
- otherwise "uchar" will be typedef'd wrongly.
-
- /* 3 */
-
- An estimate of the typical amount of memory likely to be free should be
- given in DEFAULT_MEMORY_SIZE. (This is only a default setting.)
-
- There are three settings: HUGE_SIZE, LARGE_SIZE and SMALL_SIZE.
- (I think it was Andrew Plotkin, though, who remarked that HUGE_SIZE
- might sensibly be renamed "not-bad-by-1980s-standards-size":
- these all allocate quite small amounts of memory compared to, say,
- the 8M of workspace that Windows appears to need just to keep breathing.)
-
- For most modern machines, LARGE_SIZE is the appropriate setting, but
- some older micros may benefit from SMALL_SIZE.
-
- /* 4 */
-
- This section specifies the filenaming conventions used by the host OS.
-
- It's assumed that the host OS has the concept of subdirectories and
- has "pathnames", that is, filenames giving a chain of subdirectories
- divided by the FN_SEP (filename separator) character: e.g. for Unix
- FN_SEP is defined below as '/' and a typical name is
-
- users/graham/jigsaw.z5
-
- Normally the comma ',' character is used to separate pathnames in a
- list of pathnames, but this can be overridden by defining FN_ALT
- as some other character. Obviously it should be a character which never
- occurs in normal pathnames.
-
- If FILE_EXTENSIONS is defined then the OS allows "file extensions" of
- 1 to 3 alphanumeric characters like ".txt" (for text files), ".z5"
- (for game files), etc., to indicate the file's type (and, crucially,
- regards the same filename but with different extensions -- e.g.,
- "frog.amp" and "frog.lil" -- as being different names).
-
- If FILE_EXTENSIONS is defined, then Inform uses the following standard
- set of extensions unless they are overridden by other definitions
- at this point. (Please don't override these definitions without reason.)
-
- Source_Extension ".inf" Source code file
- Include_Extension ".h" Include file (e.g. library file)
-
- Code_Extension ".z3" Version 3 story file
- V4Code_Extension ".z4" 4
- V5Code_Extension ".z5" 5
- V6Code_Extension ".z6" 6
- V7Code_Extension ".z7" 7
- V8Code_Extension ".z8" 8
-
- Module_Extension ".m5" Linkable module file (version 5,
- which is all that Inform 6 supports yet)
-
- ICL_Extension ".icl" ICL file
-
- The debugging information file and the transcript file also have defined
- default-names which can be over-ridden in this section if desired:
-
- Transcript_File "gametext.txt" or "gametext"
- Debugging_File "gameinfo.dbg" or "gamedebug"
-
- If you do not define FILE_EXTENSIONS, then it is essential to define
- STANDARD_DIRECTORIES instead. (You can also define both, if you so
- choose.)
-
- The STANDARD_DIRECTORIES option causes Inform to put all files of
- a particular kind into a standard directory for them: e.g., a "games"
- directory might hold the story files compiled, etc. All that happens
- when a standard directory is defined is that Inform sets the default
- value of the relevant pathname variable to that standard directory:
- otherwise, its pathname variable starts out as "".
-
- The standard directories are, once again, defined by default as follows:
- once again you can define these settings yourself, but please don't
- do so without a good reason.
-
- Source_Directory "source"
- Include_Directory "library"
- Code_Directory "games"
- Module_Directory "modules"
- Temporary_Directory ""
- ICL_Directory ""
-
- Note that the actual user of Inform can still override anything you
- choose by setting the pathname with an ICL command.
-
- A good way to test all this is to run inform -h1, which does some
- experimental filename translations and prints the outcome.
-
- /* 5 */
-
- Section 5 contains information on how to choose the filenames for the
- three temporary files. (Note that this needs to be done even if
- USE_TEMPORARY_FILES is not defined.)
-
- On many machines, you only need to give a suitable name. (As usual,
- if you don't bother, something fairly sensible happens.)
-
- Temporary_Name is the body of a filename to use (if you don't set this,
- it becomes "Inftemp") and Temporary_Directory is the directory path for
- the files to go in (which can be altered with an ICL command).
-
- However, under some multi-tasking OSs it is desirable for multiple Inform
- tasks to work simultaneously without clashes, and this means giving
- the temporary files filenames which include some number uniquely
- identifying the task which is running. If you want to provide this,
- define INCLUDE_TASK_ID and provide some code...
-
- #define INCLUDE_TASK_ID
- #ifdef INFORM_FILE
- static int32 unique_task_id(void)
- { ...some code returning your task ID...
- }
- #endif
-
- /* 6 */
-
- Finally, section 6 is "anything else". In particular this is where
- DEFAULT_ERROR_FORMAT should be set. This switches between different
- styles of error message. (This is not a matter of aesthetics: some
- error-throwback debugging tools are very fussy about what format
- error messages are printed out in.)
-
- For example, here is a typical OS definition block:
-
- #ifdef UNIX
- /* 1 */
- #define MACHINE_STRING "Unix"
- /* 2 */
- #define CHAR_IS_SIGNED
- /* 3 */
- #define DEFAULT_MEMORY_SIZE LARGE_SIZE
- /* 4 */
- #define FN_SEP '/'
- #define FILE_EXTENSIONS
- /* 5 */
- #define Temporary_Directory "/tmp"
- #define INCLUDE_TASK_ID
- #ifdef INFORM_FILE
- static int32 unique_task_id(void)
- { return (int32)getpid();
- }
- #endif
- #endif
-
-
- 2.5 Running Inform in a multi-tasking OS
- ------------------------------------
-
- As mentioned above, if Inform is being used in a multi-tasking environment
- then temporary file-naming will need a little attention.
-
- Another issue is that under some systems the other tasks may all freeze
- up while Inform is working, because tasks only voluntarily hand control
- back to the OS (allowing it to poll the other tasks and share out the
- processor time). This means that some call to an OS primitive routine
- may have to be inserted into Inform somewhere: a good place to do this
- is in the routine reached_new_line() of the section "lexer.c".
-
-
- 3 The front end
- -------------
-
- 3.1 The ICL interpreter
- -------------------
-
- Inform's front end is quite simple and there is little to say about it.
- Its task is to translate the user's compilation command into five kinds of
- ICL variable:
-
- switches, which are on/off flags controlling compilation options;
- switch settings, which are numerical values (between about 1 and 8)
- controlling more finely-controllable compilation options;
- path variables, which hold the text of some directory pathname;
- memory settings, which hold positive numbers (the extent of certain
- memory allocations within Inform);
- the filenames of the source code to read, and the object code to write.
-
- See "inform.c" and the memory-setting-parser in "memory.c" for the details.
-
-
- 3.2 Fundamental method
- ------------------
-
- It is not possible, in almost any compiled language, to make a direct
- translation from source to object code "a line at a time": the first time
- a line is reached, it often cannot be finally translated until information
- is known which cannot yet be known. (For example, Inform translates an
- object's name into a number indicating its position in the final game's
- tree of objects. If the name comes up before the definition of the object
- has been seen, Inform cannot know what number to translate the name into.)
-
- Inform makes only one pass through the entire source code. The translation
- it makes is (roughly speaking) left blank in places, and these blanks are
- filled in at the very end, once the necessary information is available.
- This process is called "backpatching": Inform is patching things up behind
- itself.
-
-
- 4 Lexical analysis
- ----------------
-
- 4.1 Aim and structure of the lexer
- ------------------------------
-
- The task of the lexical analyser, or "lexer" for short, is to translate the
- text it reads into a sequence of "tokens". The aim is that the rest of the
- compiler should work with the tokens and need never look at the original text
- again; Inform mainly achieves this aim.
-
- Each token represents a sequence of characters, called a "lexeme", which is
- indivisible from a syntactic point of view. For instance, the text
-
- $12+duck.&feathers
-
- contains five atomic pieces of syntax:
-
- Token Lexeme
-
- <the number 18> $12
- <the separator +> +
- <symbol number 1> duck
- <the separator .&> .&
- <symbol number 2> feathers
-
-
- The Inform lexer has three, or perhaps four levels:
-
- Level 1: reading in text, filtering out control characters, etc.
-
- Level 2: a context-free tokeniser
-
- Level 3: a context-dependent process of deciding whether any
- identifier names found are keywords or names made up
- by the programmer
-
-
- To make a fairly revolting analogy: when parsing dinner, lexical analysis
- is done with the teeth, and syntax analysis with the stomach. But with
- programming languages it is difficult to give a very precise point at
- which one ends and the other begins: and in Inform, the "perhaps level 4"
- of the lexer occurs when tokens are sorted out more finely in the expression
- parser (is this an operator? is it a constant value? is it a variable name?),
- which will be documented as part of the syntax analyser in this manual.
- At any rate, "lexer.c" and "symbols.c" contain levels 1 to 3 only.
-
- For details of some standard algorithms in compiler theory, the account below
- and in section 5 gives page references to "ASU": this is Aho, Sethi and
- Ullman (1986), "Compilers: Principles, Techniques and Tools", the so-called
- "Dragon book".
-
- It's called this because the front cover features a knight armoured in
- various algorithms confronting a dragon labelled "Complexity of compiler
- design": in the case of Inform, though, the dragon might reasonably be
- twice as big and labelled "Ignorance of designer who originally chose the
- syntax for Inform". Compiler-making tools like "lex", "yacc" and "bison"
- are difficult to apply to this language, since Inform doesn't have a
- convenient LALR(1) grammar (more like LALR(4), in fact). In any case the
- lexical and syntax analysers are hand-coded for speed, which is generally
- considered to produce code twice as fast as that generated by mechanical
- tools like "yacc".
-
-
- 4.2 Level 1: lexical blocks and buffered input
- ------------------------------------------
-
- The lexer can take input from two sources: from the source code files which
- are being compiled, or from a null-terminated string somewhere in the
- compiler. (The latter is used when compiling the "veneer" of run-time
- routines.)
-
- A "lexical block" is a piece of text which is individually named and
- line-counted (so that error messages can indicate where an error has occurred
- with reference to the original source). Each single file of source code
- (the original file and each Include file) is a lexical block in its own
- right. Likewise, if the lexer is reading from an internal string, then
- that string is a lexical block.
-
- The LexicalBlock structure holds data about the position inside such a block.
- Note that several may be needed at once: if one source file includes another,
- which includes another, then three different LexicalBlocks will contain
- useful information.
-
- However, information about the files currently being worked from is stored
- in a different structure way, and not in LexicalBlock structures. For
- efficiency reasons, we can't read characters out of the files one at a time:
- instead we read them in 4096-byte chunks into a buffer. Note that this
- means that the read-position of a file will be well ahead of the
- read-position of the lexer within it. It may have been closed before the
- lexer is halfway through its buffer.
-
- A further complication is that it's natural to use a stack structure to hold
- the files being read from:
-
- include file 2 <-- reading from this
- include file 1 <-- still open but not being read from
- main source code file <-- still open but not being read from
-
- and it is also natural to use a stack structure to hold the LexicalBlocks
- being worked through:
-
- LB for include 2
- LB for include 1
- LB for main source code file
-
- Despite the apparent similarity here, we can't combine this into one stack,
- since "include file 2" will have been pushed off the files stack before its
- LB is pushed off the LB stack. Since further Include directives may occur
- at any point in the current LB, the stacks can be very different.
-
-
- Otherwise level 1 is simple and fast. Note that characters are fed up into
- level 2 through a "pipeline" (see the next section for what this means and
- why), and are also filtered. Inform can read source files in any plain
- ASCII or any of the ISO 8859 standard ASCII extensions (five variants with
- accented Latin characters, plus Cyrillic, Arabic, Hebrew, Greek). The
- filtering process removes any character codes undefined in the current
- ISO setting, and normalises new-line and spacing characters:
-
- 00 remains 0 (meaning "end of file")
- TAB becomes SPACE
- 0d becomes 0a, i.e., '\n'
- other control characters become '?'
- 7f becomes '?'
- 80 to 9f become '?'
- a0 (ISO "non-breaking space") becomes SPACE
- ad (ISO "soft hyphen") becomes '-'
- any character undefined in ISO between a0 and ff is mapped to '?'
-
- (Here "ISO" means "the current ISO setting", which can be 0, meaning
- plain ASCII only: in this mode any character value of 80 to ff will
- be filtered to '?'.)
-
- The filtering ensures that (i) no error message can contain an unprintable
- character and (ii) the user cannot type a character outside a standard
- ISO set and which might work on one platform but not on another.
-
- Note that the 0 character is only fed upwards into level 2 when the entire
- lexical source has run out: that is, all the source files are exhausted,
- or else the string being analysed has run out.
-
-
- 4.3 Level 2: breaking text into lexemes
- -----------------------------------
-
- The input to level 2, then, is a stream of characters: and its output is a
- stream of tokens, each of which represents a sequence of characters called a
- "lexeme".
-
- Some definitions of terms used in the Inform source code. There is a fixed
- set of "separators", all of them sequences of 1 to 3 mainly non-alphabetic
- characters:
-
- -> --> -- - ++ + * / %
- || | && & ~~
- ~= ~ == = >= >
- <= < ( ) ,
- .& .# ..& ..# .. .
- :: : @ ; [ ] { }
- $ ?~ ?
- #a$ #n$ #r$ #w$ ## #
-
- An "identifier" is a sequence of one or more characters from the set:
-
- A B C D ... Z a b c ... z 0 1 2 3 4 5 6 7 8 9 _
-
- which does not begin with 0 to 9. The lexer makes the longest lexeme it
- possibly can for any particular token, so the next character after any
- identifier will not be one of the set above.
-
- The lexemes representing numbers are:
-
- (a) one or more chars from the set 0 1 2 3 4 5 6 7 8 9
- (b) $ followed by one or more chars from 0 1 2 3 4 5 6 7 8 9 A B C D E F
- (hexadecimal numbers)
- (b) $$ followed by one or more chars from 0 1
- (binary numbers)
-
- Note that "-67" is not a lexeme for a single token: it is broken into two
- lexemes, "-" and "67".
-
- A ", followed by any number of characters and then another ", is a single
- lexeme. Similarly for '.
-
- Finally, as a special case, the six separators
-
- #a$ #n$ #r$ #w$ ## #
-
- are expected to be followed by an identifier. For example, "##Take" is
- a single lexeme, and is not divided into "##" and "Take". (Except that
- #n$ can be followed by any character, so that e.g. #n$1 is a single token,
- representing "the new dictionary word '1'" in the language.)
-
- To sum up, the lexical analyser expects legal source code to contain:
-
- "comments": sequences beginning with a ! character and ending with a
- new-line or the end of the file or string containing it
-
- any of the lexemes above
-
- "white space", that is, characters in the set
- space tab new-line
-
-
- When breaking down text into lexemes (and throwing away the white space
- and comments), the lexer needs 3 characters of look-ahead. That is, it can
- decide definitely what lexeme character N belongs to provided that it knows
- what characters N+1, N+2 and N+3 are. (The figure 3 occurs because that's
- the worst case arising: deciding whether character N is the start of a
- separator in the form "#a$" followed by an identifier character.)
-
- Because of this, level 1 maintains a "pipeline" of four variables to hold the
- current character and the next three on the way. By looking ahead at the
- pipeline as needed, level 2 decides what to do with the current character and
- then advances one character: the three waiting in the pipeline shuffle up and
- a new character is read into the last place in the pipeline. The advantage
- of maintaining a pipeline is that the lexer never needs to undo any decisions
- and go backward through the text.
-
-
- One token is output for each lexeme found, and when the lexer runs out of
- input altogether (when all the source files are finished, or when the string
- being analysed has run out) a special token, called "end of file", is output.
- Thus the lexer's output is a sequence of tokens from the list:
-
- numbers
- strings in " quotes
- strings in ' quotes
- separators
- identifiers
- end-of-file
-
- As will be seen in the next section, identifiers are analysed further before
- being passed out of the lexer.
-
- Except for the problem of deciding what an identifier means, the lexer is
- "context-free": it needs to know nothing about the syntax of the Inform
- language, what kind of code it is currently analysing, etc. There is a
- standard state-machine algorithm for writing such a lexer: see ASU, p. 98.
- The tricky part is to efficiently store a transition table for the states
- involved, which will be neither so small that slow code is required to
- read it, nor so large that it takes up an unacceptable amount of memory.
- Here the "tokeniser_grid" stores most of a transition table: this is an
- algorithm originally suggested to me by Dilip Sequeira, similar to that used
- by S. C. Johnson's "yacc" lexical analyser.
-
-
- 4.4 Representation of tokens
- ------------------------
-
- Tokens are internally stored as quadruples:
-
- typedef struct token_data_s
- { char *text;
- int value;
- int type;
- int marker;
- } token_data;
-
- though the lexer is not responsible for writing "marker" values, which are
- the concern of the syntax analyser. The "type" identifies which kind of
- token it is (this value must be one of the *_TT set #define'd in "header.h"):
- for example, DQ_TT means "a double-quoted string". The "value" is a
- number whose meaning depends on the type. For example, in the case of
- type DQ_TT, the number has no meaning and is not used; in the case of type
- SEP_TT (for "separator"), the number gives the index in the above table
- of possible separators, thus identifying which separator it is.
-
- "text" is a pointer to a null-terminated string containing the lexeme. There
- are two reasons this is needed: firstly, so that comprehensible error
- messages can be printed out from higher up in the compiler, and secondly
- because in the case of DQ_TT and SQ_TT the text may need to be compiled
- into Z-text format and added to the static strings area, to the Z-code area
- or to the dictionary. The decision on whether the string should be compiled
- and if so, what to, cannot be taken until code generation time.
-
- Unfortunately code generation time may not be for a while, and this raises a
- problem: we clearly need to keep lexeme text in memory for a while, but it
- would be very costly on memory to keep the text of every token in memory
- permanently. When is it safe for the lexer to throw a lexeme away?
-
- One solution would be for a formal system by which the code generator
- explicitly deallocated the space, but this is too bureaucratic and too risky
- (suppose we miss 0.1% of these? Memory will slowly clog up and begin to cause
- odd errors on large compilation runs). Instead, the lexer simply writes its
- text cyclically into a large buffer. Code generation almost always takes
- place before the lexer has got any more than 100 or so bytes of text ahead;
- since the buffer is a good 10K long, there is a generous safety margin.
-
-
- 4.5 Level 3: identifying identifiers
- --------------------------------
-
- A "keyword" is an identifier which has a meaning understood by the Inform
- compiler: for example the lexeme
-
- while
-
- is understood (in some contexts) to refer to the "while" statement, rather
- than being a name coined by the programmer for some variable or object.
-
- In most small languages such a lexeme would always be a keyword and never a
- name. (This is the concept of "reserved word": a word reserved for the use
- of the compiler, forbidden as a name.) For example, the fairly-large language
- C++ has 48 such keywords.
-
- Inform is unable to reserve keywords (that is, forbid their use as names) for
- two reasons: firstly, because there are 281 of the damn things (and 116 of
- them are only significant in lines of assembly language: why forbid the use
- of these to non-assembly-language programmers?), and secondly because it is
- important to the language that some keywords definitely should be the same as
- some object names. "Class" is both a keyword meaning "the class-making
- directive" and a name meaning "the object representing the class of all
- classes".
-
- So the decision "is this identifier intended to be a keyword or intended
- to be the name of something?" relies on the context. What the lexer does
- is to divide the 281 keywords into twelve groups. The syntax analyser
- switches these groups on or off as it finds its way through the program.
- For example, the group "opcode_names" is only enabled (i.e., switched on)
- after an @ character has been read at the beginning of a statement.
-
- A complication is that case is sensitive for some of these groups and not
- others. For example, "IFDEF and "Ifdef" both match the directive name
- "ifdef", but "WHILE" and "While" do not match the statement name "while".
- The rules in fact are:
-
- When matching... Example Sensitive?
-
- Language keywords:
- directive names Class No
- keywords used within directives with Yes
- statement names while Yes
- keywords used within statements else Yes
- condition names hasnt Yes
- built-in function names random Yes
- built-in constant names #code_offset Yes
-
- Assembly language keywords:
- opcode names put_prop Yes
- customised opcode names @"1OP:4S" Yes
- the stack pointer name sp Yes
-
- Names for variables, objects, etc. lantern No
- Names for actions Take No
-
- Moreover, in contexts where local variables are allowed, they take
- precedence over the rest. (For example, you could define a local variable
- called "random", and for the duration of that routine the word "random"
- would never be recognised as the keyword for the system function random().)
-
- Each keyword group has its own token type. The token value for a keyword
- token is the index within the group: see "header.h" for #define's for
- all of these indices.
-
- Finally, there are times when the syntax analyser just wants the raw text
- of an identifier, and doesn't want it identified (for efficiency reasons:
- this prevents it from being entered into the symbols table). If the
- dont_enter_into_symbols_table flag is set, the lexer simply returns a
- token of type DQ_TT as though the name had been found in double-quotes.
-
-
- 4.6 Hash-coding and comparing names
- -------------------------------
-
- It would be computationally very expensive indeed to decide whether string X
- is a keyword or not by carrying out direct string comparisons with all 281
- keywords. Instead, hash-coding is used. The idea is to perform a fast
- operation on X whose result is some number from, say, 1 to 100: this is the
- hash code of X. If X is the same string as K, then K must have the same hash
- code. So we organise the keywords into 100 different groups, the group with
- hash code 1, with hash code 2, etc.: then if h is the hash code of X, we only
- need to compare X with the keywords in group h.
-
- For this to be any use, the operation performed has to be chosen so that
- the keywords are sorted out into groups with about equal numbers in each.
- Many sensible "hash functions" are known (see ASU, p. 437 for experimental
- performance data). Inform's choice is simple but effective: it uses
- multiplication rather than bit-shifting, but then on modern CPUs
- multiplication is not very expensive. The algorithm is what ASU would
- call "X 30011":
-
- extern int hash_code_from_string(char *p)
- { unsigned int32 hashcode=0;
- for (; *p; p++) hashcode=hashcode*30011 + case_conversion_grid[*p];
- return (int) (hashcode % HASH_TAB_SIZE);
- }
-
- where "hashcode" is meant to overflow repeatedly. Note that 30011 is a prime
- number. (case_conversion_grid[*p] is just a fast way to convert all lower
- case letters to upper case ones, since we want case to be insignificant when
- calculating hash codes.) It doesn't matter if this algorithm behaves
- differently on different machines (depending on how the CPU copes with
- overflows): all that matters is it delivers a reasonable spread of hash
- values.
-
- The hash code range is 0 to HASH_TAB_SIZE-1: this is a memory setting
- which by default is 512. Reducing this by any substantial amount causes a
- significant slowing down of Inform.
-
-
- 4.7 The symbols table
- -----------------
-
- If an identifier, in context, isn't a keyword then it is called a "symbol":
- this basically means "a name for something created by the programmer"
- (except that Inform creates a few symbols automatically as well).
-
- The section "symbols.c" keeps a table of all the symbols which have turned
- up so far. (It often picks up a few extraneous names of keywords, too,
- when the lexer has been looking ahead in the wrong context: but this doesn't
- matter and it's faster to tolerate the slight waste of memory.)
-
- The symbols table is organised into HASH_TAB_SIZE-1 linked lists, each of
- which is kept in alphabetical order. A single operation is provided for
- searching it: symbol_index(), which works out the hash code of the string S
- supplied (unless it's already known, in which case it need not work it out
- again) and looks through the linked list for that hash code until the
- first letter matches. (Thus, the only full string comparisons made are
- between S and those strings in the table which have the same hash code
- and the same initial letter.) If S is not present, it is inserted into
- the table.
-
- In addition to its text, each symbol in the table has four pieces of data
- attached: its type, its value, the line on which it was assigned and its
- flags. The symbol with index i has:
-
- symbs[i] text
- stypes[i] type char
- svals[i] value int32
- sflags[i] flags word int
- slines[i] line number int32
-
- (It would be more elegant to store this as an array of structures, but
- the symbols table is a major consumer of memory, so we can't afford to let
- lazy compilers waste it by aligning the fields of such a structure
- at 4 byte intervals. Hence, five separate arrays.)
-
- Types and values are assigned by the syntax analyser: the type indicates
- what kind of thing the symbol is a name of (a routine, a label, an object,
- etc.), and must always be one of the #define names *_T (see "header.h"). The
- value holds some number indicating which particular thing the symbol is a
- name of (a routine address, an object number, etc.).
-
- The line number is set when the symbol is first used, and then reset when it
- is first assigned a value (if ever). The information is stored as
-
- file-number*0x10000 + line-number
-
- where file-number is an index into the InputFiles array maintained by
- "files.c".
-
- The flags word holds (presently) 15 different flags, for different properties
- that the symbol name can have. These flags have #define names in the form
- *_SFLAG.
-
- The token for a symbol has type SYMBOL_TT and value equal to the index i of
- the symbol within the symbols table.
-
-
- 4.8 Token lookahead: the circle
- ---------------------------
-
- The output of the lexer is sent, one token at a time, when the syntax
- analyser calls get_next_token().
-
- The syntax analyser is a predictive parser which makes mistakes and has to
- backtrack (that is, it reads in token X, guesses that it might be looking at
- XY and reads the next token to see... but reads a Z instead, and has to go
- back to X and try a different theory). So the syntax analyser needs to be
- able to go backwards, as well as forwards, in the stream of tokens.
-
- This means that the lexer needs to remember its last N tokens, where N is
- the maximum number of backward steps the syntax analyser ever takes in one
- go. The number N is the amount of "lookahead", since one could also think
- of it as the maximum tokens one needs to look ahead of the current position
- to be sure of what is happening.
-
- These tokens are therefore stored in a "circle" of N+1 tokens: when
- get_next_token() is called, the lexer works out the next token, writes
- it into the circle and moves on one place clockwise; when put_token_back() is
- called, the lexer moves back one place anticlockwise, so that the next
- get_next_token() will send back the token in that position rather than work
- out a new one.
-
- However, the interpretation of identifiers as keyword or symbol depends on
- the current context, and this may have changed since the token was first
- calculated. Therefore a numerical representation of the context is stored
- in the circle too, so that if this has changed and if the token is an
- identifier then it is re-identified under the context which now prevails.
-
-
- 4.9 Summary of the lexer's output
- -----------------------------
-
- To summarise, the lexer converts source text to a stream of tokens of types
- as follows:
-
- --------------------------------------------------------------------------
- Type Lexeme Value
- --------------------------------------------------------------------------
- NUMBER_TT literal number in the number
- decimal, hex or binary
- DQ_TT string extracted from (none)
- double-quotes
- (or identifier name:
- see above for when)
- SQ_TT string extracted from (none)
- single-quotes
- SEP_TT separator a #define'd *_SEP value
- EOF_TT (end of source) (none)
-
- SYMBOL_TT identifier assumed index in the symbols
- to be a name table
-
- LOCAL_VARIABLE_TT local variable name 1 to 15, its Z-machine
- variable number
- STATEMENT_TT statement name a #defined *_CODE value
- SEGMENT_MARKER_TT with/has/class etc. a #defined *_SEGMENT value
- DIRECTIVE_TT directive name a #defined *_CODE value
- CND_TT named conditional a #defined *_COND value
- SYSFUN_TT built-in function a #defined *_SYSFUN value
- OPCODE_NAME_TT opcode name a #defined *_zc value
- (i.e., an internal
- opcode number)
- MISC_KEYWORD_TT keyword like "char" used a #defined *_MK value
- in statement syntax
- DIR_KEYWORD_TT keyword like "meta" used a #defined *_DK value
- in directive syntax
- TRACE_KEYWORD_TT keyword used in syntax a #defined *_TK value
- of "trace" directive
- SYSTEM_CONSTANT_TT system #constant name a #defined *_SC value
- --------------------------------------------------------------------------
-
- Note that other token types do exist, but are set in "level 4": i.e., within
- the expression parser.
-
-
- 5 Syntax analysis
- ---------------
-
- 5.1 Aim and structure of the syntax analyser
- ----------------------------------------
-
- The syntax analyser is the heart of Inform, and contains within it the
- specification of the language.
-
- Inform code fundamentally consists of directives, which are instructions to
- the compiler to do something, and routines containing statements, which are
- lines of code to be executed at run-time.
-
- The syntax analyser takes as input the stream of tokens from the lexer. Some
- of these are interpreted as directives and acted on immediately. Others
- are realised to be statements and compiled; we can regard the output of the
- compiling part of the analyser as being a stream of Z-machine assembly
- language, passed down into "asm.c".
-
- In most modern compilers, the syntax analyser would convert the token
- stream into a "parse tree" expressing the structure of the input: something
- along the lines of
-
- routine
- / \
- statement statement
- | |
- "if" ...etc., etc.
- / \
- condition statement
- | |
- "==" "print"
- / \ |
- "x" "0" "Good heavens!"
-
- This is aesthetic, and makes possible many optimisations (such as elimination
- of common sub-expressions), since the compiler is able to see the whole
- structure before beginning to compile it. But it uses up a fair amount of
- memory (admittedly, most compilers only keep the tree for the current
- routine, not the tree for the entire program); and a lot of speed, as the
- tree is trawled through again and again.
-
- Characteristically, Inform instead uses an algorithm which is about 40 years
- old (see ASU, p.181 et seq). It doesn't generate a parse tree for whole
- routines, though it does do so for expressions, assignments and conditions:
- for example, it would have built the little tree:
-
- "=="
- / \
- "x" "0"
-
- For higher level-parsing, it works "top-down", making recursive function
- calls which mimic a depth-first traversal of the tree drawn above. That is,
-
- routine()
- calls statement()
- which reads the token "if"
- and calls condition()
- which works out the expression "x == 0"
- and compiles some code to perform this test and
- then to skip the next instruction if it's false
- and calls statement()
- which reads the token "print"
- and then the token "Good heavens!"
- and compiles some code to print this
- and routine() next calls statement()
- which reads... etc., etc.
-
- Although we are only able to go through the tree once, it's a quick and
- efficient trip, effectively using the C run-time stack to hold the parse
- tree structure. The result is also a rather legible piece of source
- code (as compared, for example, with the output from "yacc"). (The main
- reason we can escape from having to make parse trees is that our target
- machine, the Z-machine, has an architecture making register allocation
- unnecessary: the variables are the registers, and there's no space or
- time penalty for use of the stack.)
-
- The syntax analyser begins with parse_program() in the section "syntax.c".
- Because its structure is embodied in the source code, there is relatively
- little to say about it except to refer the reader to that: and to the
- the table in section 1.1 above.
-
-
- 5.2 Predictive parsing
- ------------------
-
- Note that tokens can be read from many different places, and code can be
- passed out from many different places: whereas the lexer and the assembler
- each have a single channel of input and output.
-
- As a typical example of the syntax analyser working as a "predictive parser"
- and having to "backtrack", consider how parse_action() begins to parse through
- action statements like
-
- <Take fishknife>; tokenised as <
- symbol "Take"
- symbol "fishknife"
- >
-
- When it's called, the first token, "<" has already been read (this is how
- parse_statement() knew to call parse_action() in the first place).
-
- What it does is:
-
- "Predict" that it's going to be a << ... >> statement
- Ask the lexer for a token, expecting it to be another "<"
- Discover that, unfortunately, it's a symbol, so the prediction was wrong
- Backtrack to the point where it made the wrong prediction
- (this actually only means giving the token "Take" back into
- the lexer again)
- Now predict the next possibility, that it's a < ... > statement
- Ask the lexer for a token, expecting it to be a symbol name
- Discover that this time it is ("Take"), so the prediction was correct
- ... and so on.
-
- The code to do this is more like:
-
- get_next_token();
- if (token == "<") return parse_double_angle_action();
- put_token_back();
- return parse_single_angle_action();
-
- (The actual code doesn't make these inefficient function calls, and token
- comparison is a bit fiddlier, but this is the idea.)
-
-
- Clearly, the longer such a prediction goes on before it is found to be
- wrong, the more tokens which the lexer has to take back again. The syntax
- of the language determines this number, N: for many languages N = 1 (because
- they've been designed that way) but for Inform N = 5 (because it was defined
- by an ignorant oaf). (Though, as will appear in the next section, it would
- be possible to reduce this by implementing the grammar differently: and
- there are compensations, such as the relative conciseness of Inform code.)
-
-
-
- 5.3 The context-free grammar
- ------------------------
-
- Here are the productions which the top-down syntax analyser implements.
- (Most of the major ones are handled by routines.)
-
- "void_expression", "constant", "condition" and "quantity" are left undefined:
- these are handled by the expression parser according to an operator-precedence
- grammar. "vcondition" is a condition whose first token is a VARIABLENAME.
- Symbols are represented in capitals; NEWNAME means one not so far assigned
- a value, while CLASSNAME, etc., mean an existing symbol of the given type.
- Obsolete features which are still supported are omitted.
-
- Details of conditional compilation are abbreviated heavily: the notation
- PROGRAM means "anything at all until the right directive keyword comes up
- at the same #if... level".
-
- =========================================================================
- program -> directive ; program
-
- directive -> [ NEWNAME routine
- "Abbreviate" strings
- "Array" NEWNAME arraytype array
- "Attribute" NEWNAME
- "Attribute" NEWNAME "alias" ATTRIBUTENAME
- "Class" NEWNAME objbody
- "Class" NEWNAME ( constant ) objbody
- "Constant" NEWNAME
- "Constant" NEWNAME constant
- "Constant" NEWNAME = constant
- "Default" NEWNAME constant
- "End"
- "Extend" extension
- "Fake_action" NEWNAME
- "Global" NEWNAME
- "Global" NEWNAME = constant
- "Ifdef" NAME condcomp
- "Ifndef" NAME condcomp
- "Iftrue" constant condcomp
- "Iffalse" constant condcomp
- "Ifv3" constant condcomp
- "Ifv5" constant condcomp
- "Import" manifest
- "Include" STRING
- "Link" STRING
- "Lowstring" NEWNAME STRING
- "Message" diagnostic
- "Nearby" objhead objbody
- "Object" arrows objhead objbody
- "Property" NEWNAME propdef
- "Release" quantity
- "Replace" ROUTINENAME
- "Serial" STRING
- "Statusline" "score"
- "Statusline" "time"
- "Stub" NAME quantity
- "Switches" TEXT <---- An unquoted string
- "System_file"
- "Trace" trace
- "Verb" verb
- "Zcharacter" zchar
- CLASSNAME arrows objhead objbody
-
- condcomp -> ; PROGRAM ; "Endif"
- ; PROGRAM ; "Ifnot" ; PROGRAM ; "Endif"
-
- manifest -> import "," manifest
- import
- import -> "global" NEWNAME
-
- diagnostic -> STRING
- "error" STRING
- "fatalerror" STRING
- "warning" STRING
-
- propdef -> propdefault
- "additive" propdefault
- propdefault -> <empty>
- quantity
- =========================================================================
- trace -> t_section t_level Tracing
- t_section -> <empty>
- "objects"
- "symbols"
- "verbs"
- "assembly"
- "expressions"
- "lines"
- t_level -> <empty>
- "on"
- "off"
- quantity
- =========================================================================
- zchar -> char_spec Character set
- STRING STRING STRING
- "table" char_specs
- "table" "+" char_specs
- char_specs -> char_spec char_specs
- char_spec
- char_spec -> LITERAL_CHARACTER
- =========================================================================
- arrows -> "->" arrows Object definition
- <empty>
- objhead -> <empty>
- NEWNAME
- STRING
- NEWNAME STRING
- NEWNAME OBJECTNAME
- STRING OBJECTNAME
- NEWNAME STRING OBJECTNAME
- objbody -> segment objbody
- segment "," objbody
- segment -> "class" class_s
- "with" with_s
- "private" with_s
- "has" has_s
- class_s -> CLASSNAME class_s
- with_s -> PROPERTYNAME property
- PROPERTYNAME property "," with_s
- has_s -> attributes
- property -> <empty>
- [ routine
- arrayvals
- =========================================================================
- arraytype -> -> Arrays
- -->
- "string"
- "table"
- array -> constant
- STRING
- arrayvals
- [ manyvals ]
- manyvals -> constant manyvals
- constant ; manyvals
- arrayvals -> constant arrayvals
- =========================================================================
- extension -> "only" strings priority grammar Grammar
- STRING priority grammar
- priority -> <empty>
- "replace"
- "first"
- "last"
-
- verb -> strings v_setting
- "meta" strings v_setting
- v_setting -> = STRING
- grammar
-
- grammar -> g_line grammar
- g_line -> * tokens -> g_action
- g_action -> ACTIONNAME
- ACTIONNAME "reverse"
-
- tokens -> g_token tokens
- g_token -> preposition
- "noun"
- "held"
- "multi"
- "multiheld"
- "multiexcept"
- "multiinside"
- "creature"
- "special"
- "number"
- "topic"
- "noun" = ROUTINENAME
- "scope" = ROUTINENAME
- ATTRIBUTENAME
- ROUTINENAME
-
- preposition -> DICTIONARYWORD
- DICTIONARYWORD / preposition
-
- strings -> STRING strings
-
- attributes -> attribute attributes
- attribute -> ATTRIBUTENAME
- ~ ATTRIBUTENAME
- =========================================================================
- routine -> * locals ; body ] Everything below
- locals ; body ] here is code to
- locals -> NAME locals compile
- <empty>
-
- body -> action_case : body
- "default" : body
- statement body
- action_case -> ACTIONNAME
- ACTIONNAME , action_case
-
- block -> ;
- statement ;
- { states }
- { states . NEWNAME ; }
- states -> statement states;
-
- s_block -> { s_states }
- { s_states . NEWNAME ; }
- s_states -> case : s_states
- "default" : s_states
- statement s_states
- case -> range :
- range , case
- range -> constant
- constant "to" constant
- =========================================================================
- statement -> . NEWNAME ; statement
- "@" assembly ;
- "<" action ">" ;
- "<<" action ">>" ;
- indiv_state
- STRING
- void_expression
-
- action -> ACTIONNAME
- ACTIONNAME quantity
- ACTIONNAME quantity quantity
-
- assembly -> opcode operands options
- opcode -> OPCODENAME
- STRING <--- customised opcode
- operands -> operand operands descriptions are
- operand -> constant not parsed by the
- VARIABLENAME syntax analyser
- sp
- options -> ? branch
- -> store
- -> store ? branch
- store -> VARIABLENAME
- sp
- branch -> LABELNAME
- ~ LABELNAME
- =========================================================================
- indiv_state -> "box" strings ;
- "break" ;
- "continue" ;
- "do" eblock "until" ( condition ) ;
- "font" "on" ;
- "font" "off" ;
- "for" ( for1 : for2 : for3 ) block ;
- "give" quantity attributes ;
- "if" ( condition ) block
- "if" ( condition ) block "else" block
- "inversion" ;
- "jump" LABELNAME ;
- "move" quantity "to" quantity ;
- "new_line" ;
- "objectloop" ( ospec ) block ;
- "print" printlist ;
- "print_ret" printlist ;
- "quit" ;
- "read" quantity quantity ;
- "read" quantity quantity ROUTINENAME ;
- "remove" quantity ;
- "restore" LABELNAME ;
- "return" ;
- "return" quantity ;
- "rtrue" ;
- "rfalse" ;
- "save" LABELNAME ;
- "spaces" quantity ;
- "string" quantity STRING ;
- "style" textstyle ;
- "switch" ( quantity ) s_block
- "while" ( condition ) block
- for1 -> void_expression
- <empty>
- for2 -> condition
- <empty>
- for3 -> void_expression
- <empty>
- ospec -> VARIABLENAME
- VARIABLENAME "in" quantity
- VARIABLENAME "near" quantity
- VARIABLENAME "from" quantity
- vcondition
- printlist -> printitem , printlist
- printitem
- printitem -> STRING
- quantity
- ( form ) quantity
- form -> ROUTINENAME
- "number"
- "char"
- "address"
- "string"
- "The"
- "the"
- "a"
- "an"
- "name"
- "object"
- "identifier"
- textstyle -> "roman"
- "reverse"
- "bold"
- "underline"
- "fixed"
- =========================================================================
-
- A few critical comments on the above.
-
- From this grammar it's possible to work out N, the maximum look-ahead needed
- to distinguish which production is being used in the source code. The worst
- cases are:
-
- (a) distinguishing productions (1) and (2) in
-
- ospec -> VARIABLENAME
- VARIABLENAME "in" quantity (1)
- VARIABLENAME "near" quantity
- VARIABLENAME "from" quantity
- vcondition (2)
-
- which requires N = 5; these two productions need to be distinguished
- between because
-
- objectloop ( a in b )
- objectloop ( a in b ... )
-
- (the second containing a compound condition) compile quite different
- code: one loops through the siblings in the object tree, the other
- through the tree in numerical order.
-
- (b) distinguishing productions (1) and (2) in
-
- printitem -> STRING
- quantity (1)
- ( form ) quantity (2)
-
- i.e., between
-
- print (routine-name) expression
- print (expression which happens to begin with a bracket)
-
- which requires N = 3.
-
- The grammar contains one serious ambiguity: the innocent-looking production
-
- arrayvals -> constant arrayvals
-
- means that array initialisations list constants in sequence without commas
- or other separating characters. This makes it impossible to distinguish
- between unary and binary minus in a line like:
-
- Array X 2 - 1 ;
-
- The answer is "binary", since the grammar makes the longest match possible;
- but a "this is ambiguous" warning is issued.
-
- A further inconvenience in the grammar, though not much of an ambiguity,
- occurs in the initialisation part of "for" statements: there is a danger of
-
- for (p=Class::)
-
- being mis-parsed due to "::" being recognised as a binary operator (without
- a right operand, which would cause an error) and not as two consecutive
- ":" delimiters.
-
- If I were free to redesign the Inform grammar in the light of the last
- three years' experience (which I am loath to do, since so much Inform source
- code now exists), here are the changes I think I would make:
-
- introduce commas as separators between array values and <<Action ...>>
- parameters;
- remove the statements quit, restore, save, style, font, spaces, box,
- inversion and read: the first three ought to be used via the
- assembler anyway, and the last six ought to be system-provided
- functions;
- use single quotes to refer to dictionary words used as values of "name",
- thus removing an anomalous rule going back to Inform 1, and to
- refer to dictionary words in grammar;
- find some notation for literal characters which does not look like
- the notation for dictionary words: e.g., ''z'' rather than 'z';
- abolish the distinction between actions and fake actions;
- rename the Global directive "Variable";
- require an = sign to be used in "Constant X 24" directives.
-
-
- 5.4 Assigning values to symbols
- ---------------------------
-
- Assigning values to symbols is the main way by which the syntax analyser
- changes the way it will behave further on in the program. From a strict
- theoretical point of view one could insist the symbols table contains the
- only information remembered by the syntax analyser about the program so far,
- which otherwise churns out code which it forgets as soon as it has written:
- however, Inform's analyser also keeps a number of other data structures,
- such as the game dictionary and the object tree.
-
- When the lexer creates a new symbol (that is, when the table is searched
- for a string which isn't there, so that it is added to the table as a new
- entry), it has:
-
- value 256
- type CONSTANT_T
- flags only UNKNOWN_SFLAG
- line the current source line
-
- "Unknown" means that the syntax analyser doesn't recognise it as meaning
- anything yet. The flag will only be cleared when the syntax analyser
- assigns some value to the symbol (using the assign_symbol() routine), if ever.
-
- The line is reset to the source line then applying if the symbol is assigned
- a value, but is otherwise never changed.
-
- The type and value of a symbol are only altered via assign_symbol(). With
- one exception (see CHANGE_SFLAG below), the value once assigned is never
- subsequently altered by reassignment. Each symbol always has one of the
- following 11 types:
-
- Type Meaning Value
- ------------------------------------------------------------------------
- CONSTANT_T Defined constant or Value of constant
- value not known yet
- LABEL_T Name of label in source Label number
- GLOBAL_VARIABLE_T
- Name of global variable Z-machine variable number
- ARRAY_T Name of array Z-machine byte offset
- in dynamic variables area
- ROUTINE_T Name of routine Scaled offset in Z-code
- ATTRIBUTE_T Name of attribute Z-machine attribute number
- PROPERTY_T Name of common property Z-machine property number
- between 1 and 63
- INDIVIDUAL_PROPERTY_T
- Name of indiv property Property identifier >= 64
- OBJECT_T Name of object Z-machine object number - 1
- CLASS_T Name of class Z-machine object number - 1
- of its class-object
- FAKE_ACTION_T Name of fake action Action number >= 256
- ------------------------------------------------------------------------
-
- The full list of symbol flags, and their meanings, is as follows:
-
- ------------------------------------------------------------------------
- UNKNOWN_SFLAG no value has been assigned to this symbol
- USED_SFLAG the value of this has been used in an expression
- REPLACE_SFLAG the programmer has asked to Replace a routine
- with this name
- DEFCON_SFLAG this constant was defined by the "Default"
- directive
- STUB_SFLAG this routine was defined by the "Stub" directive
- (similarly)
- INSF_SFLAG this symbol was originally assigned a value in a
- system_file
- SYSTEM_SFLAG this symbol was assigned a value by Inform itself,
- as one of the standard stock (such as "true"
- or "recreate") provided to all programs
- UERROR_SFLAG a "No such constant as this" error has already been
- issued for this name
- ALIASED_SFLAG this is an attribute or property name whose value
- is shared with another attribute or property name
- CHANGE_SFLAG this is a defined Constant whose value needs later
- backpatching, or is a Label whose label number
- has been allocated before its declaration in the
- code
- ACTION_SFLAG this is a ## name (of an action or a fake action)
- REDEFINABLE_SFLAG the symbol can be defined more than once using the
- "Constant" directive, without errors being
- produced. (Used for the special constants DEBUG,
- USE_MODULES, MODULE_MODE and Grammar__Version.)
- ------------------------------------------------------------------------
- IMPORT_SFLAG this name is "imported" (used in module compilation)
- EXPORT_SFLAG this name was "exported" from some module
- (used in story file compilation)
- ------------------------------------------------------------------------
-
- USED_SFLAG is used only to decide whether or not to issue "declared but not
- used" warnings. A symbol has been "used" if one of the following is true:
-
- (i) its value occurred in an expression;
- (ii) it has been assigned, and tested with IFDEF or IFNDEF;
- (iii) it's an action routine name used in grammar or via ##
- (e.g. use of ##Take causes TakeSub to be "used");
- (iv) it's a fake action name used via ##;
- (v) it's a property, attribute or class name used in an
- object or class definition;
- (vi) it's a label name branched to in assembly language or
- the destination of a "jump" statement;
- (vii) it's the routine name "Main";
- (viii) it's a routine name referred to by the obsolete #r$
- construct;
- (ix) it's a routine name of a veneer routine whose definition
- is being over-ridden in the source code, and to which a
- call has been compiled;
- (x) it's a routine name used in a grammar token;
- (xi) it's referred to in a module which has been linked into
- the story file being compiled.
-
- Note that such warnings are not issued for object names, since in much
- Inform 5 code objects are given irrelevant object symbol-names (the syntax
- required it); nor for symbols defined by Inform, or in the veneer, or in
- a system file. Warnings are never issued (except for labels and local
- variables, which have local scope) when compiling modules, since there is
- no way of knowing at compile time which exported symbols will be used and
- which will not.
-
- The warnings are issued at the end of the source code pass, but before the
- veneer is compiled. The slines[] array is used to work out which source
- lines to report from.
-
- CHANGE_SFLAG is used when definitions such as:
-
- Constant frog_word = 'frog';
-
- are reached, where the value (the dictionary address for 'frog') cannot
- immediately be known. Such symbol values are backpatched later as needed.
-
-
- All symbols in the table have "global scope" (that is, their definitions are
- valid everywhere in the source program) except for label names, whose scope
- is restricted to the current routine. Thus, the same label name can be used
- in many different routines, referring to a different label in each.
-
- To achieve this, the routine-end routine in "asm.c" cancels the assignment
- of label names, returning them to type CONSTANT_T and flag UNKNOWN_SFLAG.
-
- Local variables also have local scope, but for efficiency reasons these are
- not stored in the symbols table but in a special hash table in level 3 of
- the lexer.
-
-
- Note that action names have a different "name-space" from other symbols:
- this is why the library's action name "Score" is never confused with its
- variable "score". Rather than use a different symbols table altogether,
- actions are stored as integer constants in the form
-
- Score__A
-
- (thus the value of this symbol is the value ##Score). Similarly, fake
- actions are stored this way (but with type FAKE_ACTION_T rather than
- INTEGER_CONSTANT_T); both forms of symbol are flagged ACTION_SFLAG.
-
-
- 5.5 Outputs other than assembly language
- ------------------------------------
-
- Below the parse_routine() part of the syntax analyser, the only output is
- assembly language (together with dictionary words and encoded text as needed
- within it).
-
- However, parse_directive() has several other outputs, as shown on the source
- code map (section 1.2 above). Directives create objects to be remembered and
- written into the final program: arrays, verb definitions, actions, objects
- and so on. These will be called "program elements".
-
- Directives also "create" constants, attributes, properties and fake actions,
- but in these cases creation only consists of assigning a suitable value to a
- symbol. So these do not count as program elements.
-
- The data structures to hold "program elements" are all created and maintained
- within "directs.c" and its subsidiaries (such as "objects.c", the
- object-maker); they are then translated into a usable Z-machine form in
- "tables.c". (With only trivial exceptions, the data structures are not
- accessed anywhere else in Inform.)
-
- ---------------------------------------------------------------
- Program element Section Name of (main) data structure
- ---------------------------------------------------------------
- Global variable arrays.c global_initial_value[]
- Array arrays.c dynamic_array_area[]
- Release number directs.c release_number
- Serial code directs.c serial_code_buffer[]
- Statusline flag directs.c statusline_flag
- Object tree objects.c objects[]
- Common property objects.c prop_default_value[]
- default value
- Class-to-object- objects.c class_object_numbers[]
- numbers table
- Common property objects.c *properties_table
- values for
- objects
- Individual prop objects.c *individuals_table
- values for
- objects
- Table of static symbols.c individual_name_strings[]
- string values
- for property
- names
- And attribute names symbols.c attribute_name_strings[]
- And action names symbols.c action_name_strings[]
- Abbreviation text.c abbreviations_at[]
- table entry
- Grammar table verbs.c Inform_verbs[]
- Token-routine verbs.c grammar_token_routine[]
- addresses
- List of dict verbs.c adjectives[]
- addresses for
- "adjective" words
- Action routine verbs.c action_byte_offset[]
- addresses
- ---------------------------------------------------------------
-
- A rather unequal gathering: the storage required to hold the common property
- values table may be as much as 32K, whereas the statusline flag can be held
- in 1 bit.
-
- There are three other program elements: Z-code, kept by "asm.c"; the static
- strings of Z-encoded text, kept by "text.c"; and the dictionary, also kept by
- "text.c".
-
-
- 5.6 Assembly operands
- -----------------
-
- The type "assembly_operand" is used for all numerical values (and local or
- global variable references) which are destined one day to be written into the
- Z-machine. Many of these will indeed be operands in Z-code instructions,
- hence the name. Others will be property values, array entries and so on.
-
- The definition is:
-
- { int type;
- int32 value;
- int marker;
- }
-
- which is a pretty generous memory allocation for holding a signed 16-bit
- number. (However, there are no large arrays of this type; type and marker
- could easily each have type char, but I suspect this would be slower because
- of alignment problems on some compilers; and speed does matter here.)
-
- The possible types are:
-
- SHORT_CONSTANT_OT number with value between 0 and 255
- LONG_CONSTANT_OT number with any 16 bit value
- VARIABLE_OT reference to the stack pointer if value is 0
- local variable if value 1 to 15
- global variable if value 16 to 255
-
- In addition, two special types are in use by the expression parser:
-
- OMITTED_OT (the operand holds no information)
- EXPRESSION_OT reference to a parse tree
-
- The "marker" value is used to record the origin of the data, which is
- essential to make backpatching work. For example, in the line
-
- v = "Jackdaws love my big sphinx of quartz";
-
- the right operand is marked with STRING_MV, because the value which needs
- to be stored in the Z-machine is a packed string address and this cannot be
- known until after the compilation pass (when the size of the tables and
- the code area are known). Wherever the operand is written, in code or in
- a table, "bpatch.c" will later find and correct it.
-
-
- 5.7 Translation to assembly language
- --------------------------------
-
- The Z-machine has a very easy architecture to generate code for. Most
- compilers' syntax analysers generate a "three-address code" as an
- intermediate stage towards final object code; this is a sequence of
- instructions in the form
-
- x = y <operator> z
-
- together with conditional branches and labelled points to branch to. (See
- ASU, p.466.) Translating this code into assembly language for some CPU
- is a further compilation phase: the tricky part is not translating the
- operators into instructions, but deciding where to locate the values
- x, y and z. On most CPUs, a limited number of registers can hold values, and
- arithmetic operations can only be performed on these registers; moreover,
- holding data in a register is much more efficient than holding it elsewhere.
- The problem is therefore to allocate registers to quantities, and the
- performance of the compiled code depends very heavily on how well this is
- done. (Register allocation is far from being a solved problem in the way
- that lexical analysis, say, is considered to be.)
-
- What makes the Z-machine particularly easy is that its instruction set is
- more or less three-address code already. Arithmetic can be performed with
- constants as operands as well as with "registers"; and not only is every
- local or global variable automatically allocated to a register of its own,
- but a stack is available to hold intermediate values (and with no performance
- penalty for its use, since it is accessible as though it were a register).
-
- The key point to remember when looking at Z-code is that writing a value to
- the "stack-pointer variable" pushes it onto the Z-machine stack; using the
- "stack-pointer variable" as the operand for an instruction pulls the value
- being read off the stack.
-
-
- Despite the availability of the stack, it's still convenient to have a few
- spare variables to use as "registers" holding temporary values. Inform
- reserves the variables 249, 250, 251, ..., 255 for its own use: though this
- slightly exaggerates the position, since two of these are used for the
- variables "self" and "sender". One more is used to hold the switch-value
- of the current switch statement (if there is one); the remaining four in
- order to implement various system functions (such as "children") with
- inline-code rather than routine calls to the veneer. (The one inconvenience
- with the Z-machine's instruction set is that there's no good way to read the
- top of the stack non-destructively, or to duplicate the top value, or to
- reorder the top few values.)
-
-
- The syntax analyser produces code by making function calls to the assembler.
- There are four types of function call:
-
- assemble_routine_header(...)
- assemble_routine_end(...)
-
- at the start and end of every routine;
-
- assemble_label_no(N)
-
- to indicate that the Nth label belongs here (i.e., at the point where the
- next instruction will be put); and then a fair variety of function calls
- to generate actual instructions. For example,
-
- assemble_jump(N)
-
- assembles an unconditional jump to label N. A typical "three-address code"
- is assembled by a routine like
-
- assemble_2_to(mul_zc, AO1, AO2, stack_pointer)
-
- meaning "the instruction mul_zc, which has 2 operands and has a result which
- it writes to the variable indicated by the third operand". AO1 and AO2
- are assembly_operands (the abbreviation is often used in the source), and
- so is stack_pointer (it has type VARIABLE_OT and value 0).
-
-
- A typical example of how the top-down syntax analyser generates code is
- given by the code for the "while" statement:
-
- case WHILE_CODE:
- assemble_label_no(ln = next_label++);
- match_open_bracket();
-
- code_generate(parse_expression(CONDITION_CONTEXT),
- CONDITION_CONTEXT, ln2 = next_label++);
- match_close_bracket();
-
- parse_code_block(ln2, ln, 0);
- assemble_jump(ln);
- assemble_label_no(ln2);
- return;
-
- Note that this expects to match
-
- while ( condition ) statement
- or { statements }
-
- The expression parser is called to turn the condition into a parse tree,
- and its output is fed straight into the code generator for parse trees.
-
- The label numbers ln2 and ln are supplied to the routine for parsing code
- blocks because they indicate which labels the statements "break" and
- "continue" should generate jumps to.
-
- For example,
-
- while (i <= 10) print i++;
-
- generates the assembly language
-
- .L0; @jg i 10 ?L1;
- @inc i;
- @print_num i;
- @jump L0;
- .L1;
-
- In terms of function calls to "asm.c":
-
- assemble_label_no(0);
- assemble_2_branch(jg_zc, AO_for_i, AO_for_10, 1, TRUE);
- assemble_inc(AO_for_i);
- assemble_1(print_num_zc, AO_for_i);
- assemble_jump(0);
- assemble_label_no(1);
-
- (Note that incrementing is needed so often that assemble_inc() is provided
- as a shorthand: actually also because of a form of indirect addressing used
- by the Z-machine for store_zc and inc_zc, dec_zc. assemble_store() and
- assemble_dec() are also provided.)
-
-
- 5.8 Summary of assembly language instructions output
- ------------------------------------------------
-
- The list of Z-machine instructions which Inform actually generates code for
- by itself is quite short (about half of the 115 possible instructions are
- ever used). The assembly-language "@ statement" parser can, of course,
- generate every Z-machine instruction.
-
- The instructions used for three-address-style code (that is, for program
- control, arithmetic, etc.) are as follows. For function calls and returns,
- unconditional jumps and for moving values between variables, memory locations
- and the stack:
-
- call_* rfalse rtrue ret_popped
- ret inc dec store
- push pull storeb storew
- loadb loadw jump
-
- (There are various forms of call instruction to do with the number of
- operands, whether a return value is wanted or not, etc.). Six conditional
- branch instructions are used:
-
- je jz jl jg
- jin test_attr check_no_args
-
- (the first four are numerical branches, the next two related to the object
- tree, and the last one is used only in V5 or later games, and then only
- for printing out tracing code): note that each can also be used in a
- "negate this condition" form. Finally, signed 16-bit arithmetic and
- unsigned 16-bit bitwise operations:
-
- add sub mul div
- mod and or not
-
-
- A further batch of instructions is used, so to speak, in lieu of calls to a
- standard library of input/output and utility routines (like C's "stdio"):
-
- print print_ret print_char print_num
- print_addr print_paddr print_obj new_line
- insert_obj remove_obj get_parent get_child
- get_sibling get_prop get_prop_addr get_prop_len
- put_prop random aread/sread quit
- save restore output_stream
-
- with five more exotic screen control instructions used only if a "box"
- statement is ever compiled:
-
- split_window set_window style set_cursor
- buffer_mode
-
-
- 6 Syntax analysis 2: the bottom-up expression parser
- --------------------------------------------------
-
- 6.1 Map and structure
- -----------------
-
- It would be possible to continue the top-down parser down into the level of
- expressions: for instance, to have a routine parse_plus() which would
- parse_expression(), then match a plus sign, then parse_expression() again,
- and so on. This would however be highly inefficient in terms of function
- calls, difficult to recover from when an error occurs, and require either
- much greater lookahead or rather complicated mechanisms for storing parse
- trees.
-
- Expressions (including assignments, conditions and constant values) are
- therefore parsed by a different syntax analyser, which is "bottom up" (it
- begins by reading in the tokens, and only works out what they mean afterward)
- rather than "top down" (starting from an assumption about the meaning and
- then trying to find the evidence to justify it).
-
- The algorithm used is an operator precedence parser, a simple form of
- shift-reduce parser. There is alas no space to describe this beautiful
- algorithm here: see ASU, pp.195-215. In this account we treat it as a
- "black box".
-
- The expression parser "expressp.c" has the structure:
-
- --------> "Level 4" of --------> Operator
- tokens lexical analysis etokens precedence
- in parser
- |
- | etokens re-ordered into RPN
- \|/
- Emitter
- |
- | plain parse tree
- \|/
- Lvalue and
- condition checking
- |
- | more detailed parse tree
- \|/ out
-
- (This is effectively a magnification of the part of the source map in
- section 1.2 above which reads just "expressp.c".) Note that there is a
- single input and a single output channel.
-
- RPN is "reverse Polish notation": for example
-
- (2 + 4)*45 + 34 becomes 2 4 + 45 * 34 +
-
- the idea being that one reads it left to right: each number is added to
- a stack of numbers, and each operation takes some numbers off the stack
- and puts an answer back. (E.g., + takes 2 and 4 off and replaces them
- with 6. * takes 6 and 45 off, replacing them with 270. + takes 34 and 270
- off, replacing them with 304, which is the answer.) The advantage of
- this is that is unambiguous which operator acts on which operands, and the
- brackets have gone.
-
- The emitter builds a (plain) parse tree out of this sequence, as follows.
- It stacks 2 and then 4:
-
- Stack: 2 4
-
- It reads +, which it knows has two operands, and takes the top two items
- off the stack, joining them into a tree segment like so:
-
- +
- / \
- 2 4
-
- It then makes a special value, say E1, meaning "the sub-expression in this
- tree", and stacks that:
-
- Stack: E1
-
- It then stacks up 45, and reaches *: it takes off the top two values, which
- are E1 and 45, and makes
-
- *
- / \
- E1 45
-
- which it calls E2, and so on. When the process is finished, we have the
- tree:
- + <--- this is E3
- / \
- this is E2 ---> * 34
- / \
- this is E1 ---> + 45
- / \
- 2 4
-
- and the stack contains just E3, which is the "answer".
-
-
- 6.2 The operator precedence grammar
- -------------------------------
-
- A list of productions for this grammar would be tiresomely long. It can
- be roughly summarised as:
-
- expression -> ( expression )
- a single operator acting on some operands in some way
- operand -> token representing a constant
- ( expression )
- expression ( arguments )
- arguments -> <empty>
- expression , arguments
-
- The tokens representing a constant are given in the next section. The
- operators have tokens as follows:
-
- --------------------------------------------------------------------------
- Level Lexeme Usage Associativity Purpose
- --------------------------------------------------------------------------
- 0 , binary left separating values to work out
-
- 1 = binary right set equal to
-
- 2 && binary left logical AND
- 2 || binary left logical OR
- 2 ~~ unary (prefix) logical NOT
-
- 3 == binary none equal to?
- 3 ~= binary none not equal to?
- 3 > binary none greater than?
- 3 >= binary none greater than or equal to?
- 3 < binary none less than?
- 3 <= binary none less than or equal to?
- 3 has binary none object has this attribute?
- 3 hasnt binary none object hasn't this attribute?
- 3 in binary none first obj a child of second?
- 3 notin binary none first obj not a child of second?
- 3 ofclass binary none obj inherits from class?
- 3 provides binary none obj provides this property?
-
- 4 or binary left separating alternative values
-
- 5 + binary left 16-bit signed addition
- 5 - binary left 16-bit signed subtraction
-
- 6 * binary left 16-bit signed multiplication
- 6 / binary left 16-bit signed integer division
- 6 % binary left 16-bit signed remainder
- 6 & binary left bitwise AND
- 6 | binary left bitwise OR
- 6 ~ unary (prefix) bitwise NOT
-
- 7 -> binary left byte array entry
- 7 --> binary left word array entry
-
- 8 - unary (prefix) 16-bit (signed!) negation
-
- 9 ++ unary (pre/postfix) read/increment or increment/read
- 9 -- unary (pre/postfix) read/decrement or decrement/read
-
- 10 .& binary left property address
- 10 .# binary left property length
- 10 ..& (**) binary left individual property address
- 10 ..# (**) binary left individual property length
-
- 11/14 ( ) binary left function call/message send
-
- 12 . binary left property value
- 12 .. (**) binary left individual property value
-
- 13 :: binary left "superclass" operator
- --------------------------------------------------------------------------
- (**): Illegal except internally, in Inform's own source for the run-time
- veneer
- --------------------------------------------------------------------------
-
- Note that, unlike C, Inform does not provide unary plus.
-
- A binary infix operator is used in the form "X op Y"; a unary prefix operator
- in the form "op X"; a unary postfix operator in the form "X op".
-
- The "level" is the precedence level: the higher this number, the more tightly
- an operator is glued to the operands adjacent to it.
-
- Cases where two binary infix operators have equal precedence are resolved
- according to whether that level is left or right associative. For instance,
-
- a / b / c means (a/b)/c
-
- since /, on level 6, is left associative. But
-
- a = b = c means a = (b = c)
-
- since =, on level 1, is right associative.
-
- Cases in the above table where the associativity is "none" mean that an
- attempt to silently associate the operator will cause an error. For example,
-
- a == b == c
-
- generates an error as being ambiguous, but
-
- (a == b) == c and a == (b == c)
-
- are both permitted.
-
- The function-call-brackets have an asymmetric precedence level according
- to whether they're being compared with something on the left or on the right:
- the point is that
-
- object.message(...)
- function_returning_an_object(...).property
-
- must be recognised as
-
- (object.message)(...) . 12 > (...) 11
- (function_returning_an_object(...)).property (...) 14 > . 12
-
- respectively.
-
-
- 6.3 Lexical level 4: tokens to etokens
- ----------------------------------
-
- The shift-reduce parser expects tokens sorted out into operators, such as
- "+" or "ofclass", operands such as "34" or "score" (assuming that's a
- variable) and three tokens of special significance:
-
- ( ) end-of-expression
-
- "Level 4 of the lexer" does just this. It converts the stream output from
- "lexer.c" (i.e., from level 3) into a stream of tokens from the following
- table (and it uses a good deal of context information to do so).
-
- Token type Meaning
- -------------------------------------------------------------------------
- OP_TT Operator: value is a *_OP constant
-
- LARGE_NUMBER_TT Operand (number): value is the number
- SMALL_NUMBER_TT Operand (number guaranteed in the range 0 to 255):
- value is the number
- VARIABLE_TT Operand: value is Z-machine variable number
- DQ_TT Operand (text used as static string): text in text
- DICTWORD_TT Operand (text used as dictionary word): text in text
- ACTION_TT Operand (##action number): name of action in text
- SYSFUN_TT Operand (system fn name): value is a *_SYSF constant
- SYSTEM_CONSTANT_TT Operand (#system_constant): value is a *_SC constant
-
- SUBOPEN_TT Open bracket used to open a sub-expression
- SUBCLOSE_TT Close bracket used to close a sub-expression
-
- ENDEXP_TT End of expression
- -------------------------------------------------------------------------
-
- The tokens in this output stream are called "etokens", short for "expression
- tokens": they are the raw material from which parse trees are built up.
- (Although formally different from ordinary tokens, they have the same storage
- type, "token_data", in the Inform source code.)
-
- At first sight this seems an unnecessarily large stock: why not, for example,
- compile static strings and dictionary words and convert them to their
- addresses here and now? Why not evaluate system constants and action numbers
- now? Because: (i) these tokens may not be accepted by the parser, so we
- can't compile text on the strength of them (and indeed, the parse tree which
- is being parsed may not ever be code-generated from); and (ii) we can't
- allow ourselves to lose sight of where numbers are coming from so soon,
- or backpatching will be impossible.
-
-
- The presence of two different tokens for numbers - one for numbers guaranteed
- to be unsigned 8 bit, one for numbers with no such guarantee and whose value
- may be any signed 16 bit quantity - is due to the architecture of the target
- Z-machine. In Z-code considerable space savings are made by storing small
- numbers in 1, rather than 2, bytes. We are anxious to take advantage of
- this.
-
- Unfortunately, what seems to be a small number now may not be a small number
- later (after backpatching). Dictionary word values (such as 'word') are
- internally stored as the accession number of that word into the dictionary
- (say 23, if it is the 23rd different word to be entered) during the
- compilation pass, and then backpatched later to the byte address of this
- word's dictionary entry: but this will be a large number.
-
- Most operands with non-NULL backpatch markers are forced into long form, to be
- on the safe side. (This includes all operands which are forward references
- to constants not yet defined.) The exceptions are that variable and action
- numbers (though not fake action numbers) are guaranteed always to fit into
- short form.
-
-
- There are two tasks involved in translating tokens to etokens: evaluating
- tokens representing quantities into one of the "operand" etokens, and sorting
- other tokens into the right operator and special etokens. The second process
- is not as easy as it looks because of ambiguities, and is discussed in the
- next section. This section discusses the evaluation of quantity tokens.
-
- Tokens in expressions are read in a context where the keywords are local
- variables, textual operators, system function names and system constant
- names. Other identifiers are symbol names. So the token types which must
- represent quantities are:
-
- DQ_TT, SQ_TT, LOCAL_VARIABLE_TT, NUMBER_TT,
- SYMBOL_TT, SYSFUN_TT and SYSTEM_CONSTANT_TT.
-
- (Actually, this omits a detail: the largely obsolete constant forms #a$...,
- #r$..., #n$... are SEP_TT tokens converted into the above; and the very much
- still with us constant form ##... is a SEP_TT token converted
- straightforwardly into the ACTION_TT etoken.)
-
- These all translate directly into etokens except for SYMBOL_TT and SQ_TT.
- SQ_TT requires a little work because of the Inform syntax for single-quoted
- strings:
-
- 'x' means the character code (i.e. ASCII value) for "x"
- '@ss' means the character code for the German letter "sz"
- (i.e. the code specified in the accented set in the
- Z-Machine standard document)
- 'xyzzy' means the byte address of the dictionary word "xyzzy"
-
- The first two become SMALL_NUMBER_TT etokens. The third becomes a
- DICTWORD_TT etoken.
-
-
- This leaves SYMBOL_TT tokens: names for constants. There are two
- possibilities: the symbol is so far unknown, or else it has been assigned
- a value in earlier syntax analysis.
-
- In most languages it would be an error to use an unknown symbol name as
- a value: in C, for example, a function name cannot be used in an expression
- unless it has been declared beforehand.
-
- Inform is more tolerant: as far as expressions are concerned, any symbol can
- be used earlier than the point in the source code where its value is
- assigned, excepting only a local or global variable name. (This exception
- is mainly for Z-machine architecture reasons, but it also makes Inform code
- more legible.)
-
- When an unknown symbol is reached, it is converted to a LARGE_NUMBER_TT and
- marked as SYMBOL_MV, with the value being its index in the symbol table.
- (This allows the backpatcher to know where to find the true value later.)
-
- When a known symbol is reached which is flagged as CHANGE_SFLAG, this is
- treated in the same way. (CHANGE_SFLAG is given to symbols defined by
- Constant definitions whose values themselves require backpatching: for
- instance, symbols defined equal to dictionary words or routine addresses.)
-
- Otherwise, the symbol is not only known, but its current value is known to
- be correct, and so:
-
- if it's a variable, it's translated to etoken VARIABLE_TT;
- otherwise the LARGE_NUMBER_TT/SMALL_NUMBER_TT decision is now made:
- if it is marked with a marker other than VARIABLE_MV, it becomes
- "long";
- otherwise the value is compared with the range 0 to 255.
-
- Any symbol name fed into the shift-reduce parser is flagged with USED_SFLAG
- to indicate that its value has at some point been used in an expression.
- (This information enables "This ... was declared but not used" warnings to be
- made at the end of compilation.)
-
-
- 6.4 Resolution of ambiguous tokens
- ------------------------------
-
- It remains to translate to the etokens OP_TT, SUBOPEN_TT, SUBCLOSE_TT and
- ENDEXP_TT. The token types which represent operators are all from SEP_TT and
- COND_TT; open and close bracket are also SEP_TT tokens.
-
- There are two main difficulties here. Firstly, shift-reduce parsers are
- unsuitable for resolving ambiguities, and so it is not possible to map tokens
- directly into operators. The ambiguities in the expression grammar for
- Inform are:
-
- (a) ++ and -- each refer to either a prefix or a postfix operator
- (b) - refers to either a unary prefix operator or a binary infix one
- (c) ( and ) are used both to indicate function calls and subexpressions
- (d) , is used both to indicate a sequence of values, each to be evaluated
- and thrown away with only the last one kept, and to separate
- function arguments
- (e) -> does not represent an operator when parsing constants as operand
- values in a line of "@" assembly language (because it needs to
- mean "the store value goes in")
-
- These are resolved as follows:
-
- (a), (b) The "level 4 lexer" watches for context: for instance, MINUS_SEP
- translates to UNARY_MINUS_OP if there was no previous token, or
- it was an open bracket, or an infix or prefix operator, or a
- comma; otherwise it translates to MINUS_OP.
-
- (c) Similarly. If an open bracket token is read, and the previous
- token was an operand of some kind, then it must represent a
- function call (unless we are in constant context); an extra
- etoken is generated for an imaginary infix operator called
- FCALL_OP. The open bracket token is then translated to
- SUBOPEN_TT as usual; a close bracket is always SUBCLOSE_TT. Thus,
- the stream
-
- Eroica ( Beethoven , 3 )
-
- is translated into the etokens
-
- Eroica <fcall> ( Beethoven <comma> 3 )
-
- so that a tree will come out of the emitter looking like
-
- <fcall>
- / \
- Eroica <comma>
- / \
- Beethoven 3
-
- although in fact the emitter recognises what is going on and
- automatically simplifies this to:
-
- <fcall>
- / | \
- / | \
- Eroica Beethoven 3
-
- (d) A comma at the top level is either disallowed (in constant context)
- or treated as the operator <evaluate from left to right but throw
- away the result on the left and keep the one on the right>. (Just
- as in C.) Otherwise, comma is a separator of arguments. "Top
- level" means "top bracket level", which the "level 4 lexer" keeps
- track of by counting the bracket tokens going through it.
-
- (e) There is a special expression context called "assembly language",
- in which "->" is translated to ENDEXP_TT.
-
-
- The second main difficulty is entirely of the language designer's own making.
- How is the end of an expression to be detected? In C, it always possible
- when parsing an expression to say "the end will have been reached when
- such-and-such a token is reached": for example, ";" or "}". This is not
- possible in Inform.
-
- Even if one knows in advance what the terminating token ought to be, it might
- be impossible to recognise for context reasons. For example, the statement
-
- move <first expression> to <second expression>;
-
- appears to be no problem: the first expression is terminated by the keyword
- "to", the second by ";". But suppose "to" has another meaning, as the name
- of a constant or variable? In this case,
-
- move 2 + to to to;
-
- is a legal statement.
-
- For all these reasons, Inform actually detects the end of an expression at
- lexical level 4 by watching the context carefully: for instance, in
-
- move 4 * banana to ...
-
- when "4 * banana" has been passed, the next token is expected to be an open
- bracket, an infix or postfix operator, or (perhaps) a comma. Since the next
- token is none of these, the expression has ended. Note that this decision
- can be taken without knowing whether "to" is a keyword or a symbol.
-
-
- From a grammatical point of view the worst design flaw in the Inform language
- is that array initialisations (either in Array directives or object property
- values) consist of sequences of expressions given without any separator in
- between. For example,
-
- Array X --> 2 4 6 * MAX_FROGS 45 ;
-
- This is arguably a convenient shorthand. Unfortunately, it falls prey to
- that predator of all computing grammars, unary minus. The directive
-
- Array X --> 10 -5;
-
- is ambiguous. Is it an array of 10-5 = 5 entries, or an array of two entries,
- 10 and -5?
-
- In Inform 5 there was no problem since constants were not allowed to contain
- operators anyway (and negative numbers were represented by single tokens,
- which is not true in Inform 6): the directive can only mean that there are
- two entries, 10 and -5.
-
- In Inform 6 the other answer is true, because the parser always makes the
- longest match it can. Since this may mean that some Inform 5 code needs
- changing in a few places, a warning that "this is ambiguous" is issued
- for unbracketed minus signs used in array initialisations.
-
- (Illustrating the problem further, if the user tries to clarify things with
-
- Array A --> 10 (-5);
-
- then this is parsed under the normal rules as the function call 10(-5);
- therefore another constant-context rule is used to interpret brackets as
- always referring to subexpressions, not function calls.)
-
-
- Similar difficulties at first seem to occur with < and >, and with << and >>.
- Does
-
- <Action 3 -4>
-
- refer to the pair (Action, -1) or the triple (Action, 3, -4)? (It is now
- obvious that the correct syntax design would be to insist on a comma in
- between 3 and -4.) Fortunately, though, both Informs 5 and 6 apply a "make
- the longest expression possible" principle to this case, so they agree that
- the answer is (Action, -1). On the same grounds, the statement
-
- <Action x ++ y>
-
- is understood as applying ++ to x, not to y, because the "longest expression
- possible" when parsing the first expression is "x ++".
-
-
- 6.5 Constant folding
- ----------------
-
- Recall that the emitter takes as input a stream of etokens from the
- shift-reduce parser, consisting of the expression re-ordered into RPN. When
- the whole stream has been received (i.e., when it has received ENDEXP_TT),
- the emitter has to produce an operand as "answer".
-
- For instance, if the emitter receives just the operand 56, then the answer
- is 56. More often, the answer may be a special value (written above as E1,
- E2, ...) meaning "the numerical value has to be worked out by working through
- this tree". (E1 is actually a new type of etoken, of type TREE_NODE_TT.)
-
- If this were done strictly, then the original text 4 + 7 would result in the
- emitter producing the answer E1, where E1 refers to the expression tree
-
- +
- / \
- 4 7
-
- and this is not a helpful answer, for two reasons: it means the value
- couldn't be used as a constant (since it seems to need Z-code to compiled to
- work it out), and even if it's used in code, there's not much point in
- compiling the Z-code instruction
-
- @add 4 7 -> sp;
-
- just to put the value 11 on the stack.
-
- Therefore, if the emitter finds an operator that it knows how to calculate
- (such as +), whose operands are all constants (and which are not unknown
- for linking reasons), it eliminates the sub-expression in favour of the
- result. This is called "constant folding": the constant value 4+7 is
- folded up into 11.
-
- Two issues arise: firstly, arithmetic must be done exactly as it would be
- done in the Z-machine, that is, as signed 16-bit arithmetic. (Otherwise
- constants may fold differently in one port of Inform or another.)
-
- Secondly, constant folding cannot safely be performed on a marked value.
- (For otherwise: consider what might happen to the expression SODS - LAW,
- at a time when neither SODS nor LAW have been defined. Both are evaluated
- as constants with value their symbol numbers and marker SYMBOL_MV: the result
- is folded to the difference between their symbol numbers, and the marker
- is lost: i.e., the result is nonsense.)
-
- The emitter is able to fold the following operators:
-
- + - (unary or binary) * / % & | ~
- == ~= > < >= <= && || ~~
-
- The inclusion of the conditionals here is quite important: it enables
- directives like
-
- Iftrue #version_number == 3;
-
- to work, since the expression "#version_number == 3" can be parsed in a
- constant context.
-
- The inclusion of unary minus is also important, as it is only by folding
- this that the text "-45" can be parsed in constant context.
-
-
- 6.6 Checking lvalues and simplifying the parse tree
- -----------------------------------------------
-
- If the emitter produces a parse tree, then it's now modified according to
- several rules. The aim is to remove any ambiguities in the tree (for
- instance, "=" does several different things according to what its left
- operand is) and make a final correctness check (for instance, this is where
- "45 = 12" results in an error message).
-
-
- Firstly, the tree is traversed to find instances where a value is used where
- a condition was expected. (If the expression is in a condition context, the
- top node is expected to be a condition; the children of a logical operator
- &&, || or ~~ are expected to be conditions.) The configuration
-
- <value or tree>
-
- is replaced by
-
- <~= 0 condition operator>
- |
- <value of tree>
-
- If the <value or tree> has root node the "=" operator, a warning is printed
- out, since it seems likely that
-
- if (x = 4) ...
-
- (which would invariably be found true, since the value of "x=4" is 4 which is
- non-zero) is a mistype for
-
- if (x == 4) ...
-
-
- A second, depth-first, traversal is then made to remove usages of ~~, using
- de Morgan's laws:
-
- ~~(x && y) = ~~x || ~~y
- ~~(x || y) = ~~x && ~~y
- ~~(x == y) = x ~= y
- ~~(x >= y) = x < y
-
- and so on. (Note that, internally, every operator etoken has a corresponding
- negated operator etoken.)
-
-
- One ambiguity remaining in the tree is that the operators ".", ".#" and ".&"
- act on both common and individual properties (that is, properties handled by
- the Z-machine's architecture directly, and properties handled by the run-time
- veneer routines). The decision about which is intended is made here, by
- looking at the right operands of the operators. (If it's a common property
- number, or an individual property number, the decision is obvious; if it's a
- variable or some compound expression, then we decide in favour of individual
- properties (the run-time veneer routines can patch things up if this is a
- wrong decision).)
-
- So a traversal is made to carry out the following transformation:
-
- . <common .> <indiv .>
- / \ becomes / \ or / \
- a b a b a b
-
- (Since the veneer routines are themselves written in Inform code, a slightly
- different translation rule applies to them: "." always means <common .>
- and so on, while three "secret" operators are provided, "..", "..&" and
- "..#", to mean <indiv .> and so on. Outside of the veneer, use of these
- operators is illegal.)
-
- In the same traversal, function calls are distinguished from sent messages:
-
- <fcall> <send message>
- / \ \ ... / | | \ ...
- <common .> c d becomes a b c d
- or <indiv .>
- / \
- a b
-
-
- The final traversal performs "lvalue checking". Five operators act directly
- on storage objects (such as variables) rather than values: these are
-
- <lvalue> = <ordinary value>
- <lvalue> ++
- <lvalue> --
- ++ <lvalue>
- -- <lvalue>
-
- where <lvalue> has to be a reference to a storage object. There are up to
- five kinds of storage object in the tree:
-
- local/global variable VARIABLE_TT
- common property value of an object <common .> subexpression
- individual property value of an object <indiv .> subexpression
- entry in byte -> array -> subexpression
- entry in word --> array --> subexpression
-
- This makes 25 combinations. As well as checking that what ought to be an
- lvalue is indeed one of these five possibilities, the traversal also
- rewrites the tree explicitly with one of 25 operators. For instance:
-
- = <set byte array entry =>
- / \ / | \
- -> v becomes a b v
- / \
- a b
-
- and so on.
-
- Thus, the end of this process is a tree representing a syntactically correct
- expression (if necessary having mended any errors in the source code's
- description of the expression), which has a minimal number of logical
- operators and no negation operators, and in which the action to be taken by
- any node does not depend on what its children are.
-
-
- 6.7 Summary of parse tree output
- ----------------------------
-
- It is time to formally specify the input and output of the main routine in
- "expressp.c":
-
- assembly_operand parse_expression(int context)
-
- takes, as argument, one of the seven *_CONTEXT constants:
-
- QUANTITY_CONTEXT the default: the result may be any quantity
-
- VOID_CONTEXT the expression is used as a statement, so that
- its value will be thrown away and it only
- needs to exist for any resulting side-effects
- (used in function calls and assignments)
-
- CONDITION_CONTEXT the result must be a condition
-
- CONSTANT_CONTEXT the result must be known now (at compile time)
-
- ASSEMBLY_CONTEXT like QUANTITY_CONTEXT, but with the '->'
- operator forbidden
- (this is needed for assembly language to
- indicate store destinations)
-
- FORINIT_CONTEXT like VOID_CONTEXT, but with the '::' operator
- forbidden at the top level
- (needed to resolve ambiguity in grammar)
-
- ARRAY_CONTEXT like QUANTITY_CONTEXT, but with different rules
- to resolve whether a minus sign represents
- unary or binary minus
- (needed to resolve ambiguity in grammar)
-
- The return value is the result of the expression. This is an assembly
- operand, the type of which indicates what has happened:
-
- LONG_CONSTANT_OT, SHORT_CONSTANT_OT
- the result is this number
-
- VARIABLE_OT the result is in this global/local variable
-
- OMITTED_OT there is no resulting value (for instance,
- this happens if the expression was successfully
- parsed in void context)
-
- EXPRESSION_OT the result can only be determined by running
- the code generated from this parse tree
-
- If an error has occurred in the expression, from which recovery was not
- possible, then the return is (short constant) 0. This should minimise the
- chance of a cascade of further error messages.
-
- The type EXPRESSION_OT is not a valid Z-machine assembly operand type, and
- it's only used here and in the code generator "expressc.c". The value of
- such an operand is an index N, indicating that the root node of the tree is
- in Nth position in the expression-tree-array ET.
-
- The expression-tree-array ET[0], ET[1], ... is an array of "nodes", each of
- which is a structure as follows:
-
- { /* Data used in tree construction */
-
- int up, down, right;
- int operator_number; /* Only meaningful for non-leaves */
- assembly_operand value; /* Only meaningful for leaves */
-
- /* Attributes synthesised during code generation */
-
- ...
-
- }
-
- The numbers ET[n].up, ET[n].down and ET[n].right are node numbers for the node
- above, the leftmost node below, and the next child to the right of the same
- parent: they hold -1 if there is no node in that position.
-
- A "leaf" is a node for which ET[n].down == -1. Leaves hold operands, and
- other nodes hold operators. (Note that leaf operands must be of operand type
- SHORT_CONSTANT_OT, LONG_CONSTANT_OT or VARIABLE_OT: they must be honest
- Z-machine operands.)
-
- Nothing distinguishes the root node except that its number happens to be held
- in an EXPRESSION_OT operand value somewhere.
-
-
- A brief word on deallocation: clearly these parse trees need to survive
- intact until code generation occurs for them. No elaborate system is provided
- here, since the syntax analyser always generates code for any expressions it
- reads while it's still parsing the same syntax line as the expression was
- parsed on. Each time a new syntax line is reached, "expressp.c" is notified
- and it then wipes the ET[n] array empty again.
-
-
- 7 Code generation from parse trees
- --------------------------------
-
- 7.1 Aim and structure of the code generator
- ---------------------------------------
-
- The section "expressc.c" takes as input a parse tree and generates suitable
- Z-code to calculate the result of the expression which the parse tree
- embodies. As a result, its main output is a stream of function calls to
- "asm.c" (see chapter 5 for details).
-
- The syntax analyser calls only one routine:
-
- assembly_operand code_generate(assembly_operand AO, int context, int label)
-
- where AO is expected to be the output from a previous run through "expressp.c".
-
- The contexts are the same as those for expression parsing, except that only
- VOID_CONTEXT, CONDITION_CONTEXT and QUANTITY_CONTEXT are allowed.
-
- The "label" is only meaningful in CONDITION_CONTEXT:
-
- if label >= 0, branch to this label if the expression is false
- as a condition;
-
- if label == -3, rfalse if the expression is true;
-
- if label == -4, rtrue if the expression is true.
-
- (Note that the presence of rfalse/rtrue here allows better code generation
- for statements in the form
-
- if ( ... ) return;
-
- owing to a feature of the Z-machine's architecture for branch instructions.)
-
- The return AO is only meaningful in QUANTITY_CONTEXT; it holds the result
- of the expression, often the stack pointer variable (meaning, the answer
- is on the top of the stack) but not always: e.g., from
-
- j++, 2
-
- the result would be SHORT_CONSTANT_OT 2. Similarly, from the expression
-
- 2
-
- the code generator returns 2 without actually generating code.
-
-
- 7.2 Annotation for conditions
- -------------------------
-
- The first thing the code generator does is to "annotate" the tree for
- conditions. Annotation is the process of assigning useful values to
- every node on the tree, usually in such a way that either the value at
- one node determines the values for its children, or vice versa. (Such
- values are called "attributes" and this inductive definition of them
- is called "synthesis": see ASU, p.280.)
-
- The explanation of this algorithm is about five times longer than its
- source code!
-
- We wish to assign a pair of values written (on paper) in the form
-
- a | b
-
- to every conditional node (that is, every node which is a conditional
- operator such as "~=" or "ofclass", or is a logical operator such as "&&").
- This is to be read as "branch to label a if true, or label b if false".
- Inform has four special label numbers:
-
- -1 means "carry on rather than branch"
- -2 means "branch to the immediately following instruction"
- -3 means "return false rather than branch"
- -4 means "return true rather than branch"
-
- (-2 is used in the assembly of instructions like get_child_zc, which is
- a conditional branch in Z-machine terms but which Inform wants to ignore
- the branch result from. We can ignore -2 here: it behaves exactly like
- -1.)
-
- One of the two numbers a and b must always be other than -1 (otherwise we
- would have rendered a branch meaningless, which must be a mistake). Ideally
- exactly one of them is -1, so that the node can generate a single branch
- instruction.
-
- There are two points where these a | b attributes start to appear in the
- tree. Firstly, in condition context they'll appear at the root node: for
- example, in CONDITION_CONTEXT with label 40 the top node would be labelled
-
- -1 | 40
-
- ("jump to L40 if false"). Secondly, at condition subexpressions used as
- values, e.g. in the tree:
-
- =
- / \
- v ==
- / \
- t 1
-
- This is trickier: the result of == is not a branch somewhere, but a numerical
- value (which must be 0 or 1). To cope with such conversions, two more
- attributes are provided:
-
- to_expression, label_after
-
- A node flagged as "to_expression" is one where a condition is being converted
- to a numerical value; the "==" node above would have this flag set.
-
- "label_after" is either -1 or the number of a label which the code generator
- is to assemble after code generation for the node is complete.
-
- So here are the rules for generating these attributes for node N. Each node
- is sent a pair of values a | b from its parent (in the case of the root
- node, from the code_generate() routine).
-
- annotate N with
- a | b
-
- if N is &&, || or a conditional operator, and a | b is -1 | -1,
- then this must be a condition used as a value:
- so re-annotate N with
- to_expression = TRUE
- label_after = a new label L
- L | -1
-
- if N is an && operator:
- a label for "go here if false" is going to be needed to provide a
- destination for the branch generated by the lefthand operand. So,
- if b = -1, re-annotate N with:
- label_after = a new label L
- a | L
-
- if N is an || operator:
- a label for "go here if true" is going to be needed to provide a
- destination for the branch generated by the lefthand operand. So,
- if a = -1, re-annotate N with:
- label_after = a new label L
- L | b
-
- Finally, we need to pass a pair of values down to each child of the
- node. In the case of all nodes except && and ||, the values passed
- down are
-
- -1 | -1
-
- (because the operator expects numerical operands).
-
- In the case of && and ||, let x | y be the annotation which was finally
- made for this node. We pass values down as follows:
-
- && ||
- / \ / \
- -1 | y x | y x | -1 x | y
-
- (Here "condition node" means a conditional or logical operator. Recall that
- the ~~ operator was eliminated earlier.)
-
- For example, consider the condition in
-
- if (x == 1 || y >= 2) ...
-
- The syntax analyser decides to generate code which will branch to L23 if the
- condition is false. The parse tree
-
- || || -1 | 23, l_a = 50
- / \ / \
- == >= is annotated to == 50 | -1 >= 50 | 23
- / \ / \ / \ / \
- x 1 y 2 x 1 y 2
-
- where the annotation routine created label 50 as a point to branch to if the
- first condition turned out to be true.
-
- Note that it's actually a little wasteful to annotate this way: the node >=
- is annotated with 50 | 23, but label L50 will appear immediately after the
- code generated for this node anyway, so it might as well be annotated -1 | 23.
- The annotater does indeed make this simplification.
-
- Lemma: One of the values a, b sent down to any node is always -1.
-
- Proof: By induction down through the tree. The values sent down to the
- root node are either -1 | L, -3 | -1 or -4 | -1, so the property
- holds for the root node.
-
- Now suppose such a pair a | b is sent down to node N. Unless N is
- && or ||, it sends -1 | -1 to its children, so the property holds
- for the children of N.
-
- If N is &&, then it has two children, left and right. The left
- child is always sent -1 | y, so it has the property.
-
- The right child is sent x | y, unless y is a new label appearing
- after N, in which case it is sent x | -1. Note, though, that
- N itself was sent a pair in which one number was -1 (by inductive
- hypothesis) and so x | y can only have both numbers other than -1
- if a new label was created by N. Since N is an && operator
- this label must be y; and therefore the child is sent x | -1,
- which means the right child also has the property.
-
- The argument for || is symmetrical to this.
-
- Corollary: In the annotation "a | b" of every node other than a && or ||
- operator, either a is -1 and b is not, or vice versa.
-
- The annotation values a | b are only used to generate code for condition
- operators, so we now have the desired result.
-
-
- 7.3 Main traversal
- --------------
-
- The main algorithm is simple. We make a depth-first (i.e., bottom-up)
- traversal of the parse tree, pruning off sub-expressions as code is
- generated for them. In the tree
-
- *
- / \
- a +
- / \
- b c
-
- we first descend to the +, and convert that into
-
- @add b c -> sp
-
- and prune the tree to
-
- *
- / \
- a sp
-
- so that the next instruction generated is
-
- @mul a sp -> sp
-
- and the tree is now just
-
- sp
-
- and the result is therefore on the stack. A complication, however, is what
- order in which to remove the subexpressions. Faced with the tree
-
- -
- / \
- * +
- / \ / \
- a b c d
-
- do we go left first from the top -, or right first? This decision is taken
- for us by the Z-machine's architecture. When the Z-machine executes the
- instruction
-
- @sub sp sp -> x;
-
- it takes the two operands off the stack in the order left to right, i.e.,
- it pulls the first operand first and then the second. (Most stack machines
- work the other way, to make RPN expressions easy to evaluate. The Z-machine
- was designed for forward Polish notation, though, because Infocom's own
- language for it was a Lisp-variant with syntax like (PLUS 2 3).)
-
- This means that we need to push the second operand first, then the first
- operand. In other words, we have to code generate the + operation first,
- and then the * operation.
-
- Therefore we need to traverse the tree depth-first and right-to-left.
- But there is a complication: the && and || nodes need to have their children
- code-generated left-to-right. (Although "and" is logically commutative,
- so that it shouldn't matter, "&&" is not computationally commutative, because
- the language specification requires that the left condition is never evaluated
- if the right condition was false.)
-
- There is one other left-to-right node: the comma operator. For example,
- if the original value of x is 10, then
-
- ,
- / \
- ++ x
- /
- x
-
- must evaluate to 11, not 10, because "x++, x" must evaluate "x++" first,
- throw the result away and then evaluate "x".
-
-
- 7.4 Void context
- ------------
-
- As the above example demonstrates, in some cases sub-expressions need to
- produce a result and in other cases they must not do so. If the "++" operator
- had left a value on the stack, then it would still be there long after the
- expression had been evaluated.
-
- Therefore, when code generating each node it is essential to know where the
- result of the expression is supposed to go. A flag called void_flag is
- passed down to indicate this: if set, then the operator is being evaluated
- in "void context", meaning that the value should be thrown away. The rules
- are:
-
- if the expression is in VOID_CONTEXT, then the root node is;
- the left operand of a comma operator is in void context;
- if a comma operator is in void context, its right operand also is.
-
- Not every subexpression is permitted to be in void context. For instance,
-
- 3*x, y++;
-
- generates the error
-
- Result of expression is thrown away
-
- because the * calculation was pointless. Numerical values or variables are
- not allowed in void context, and nor are any operators which have no "side
- effect": i.e., whose evaluation causes nothing in the Z-machine to change.
- The operators with side effects are:
-
- = in its various forms (5 of them)
- ++ in its various forms (10 of them)
- -- in its various forms (10 of them)
- function call (including message sending)
-
- It is only now, at code generation time, that the error message above is
- generated. (In C this would only cause a warning, but I didn't want to
- waste effort making the code generator assemble code to explicitly throw
- results away.)
-
-
- 7.5 Results of subexpressions
- -------------------------
-
- By the rules above, then, either an operator is in void context and its
- result should be thrown away, or it isn't and the result should go onto the
- stack. However, this would be wasteful, as an expression like "x = 4 + y"
- (where x is a local variable, say) would then generate
-
- @add 4 y -> sp
- @pull x
-
- and give result "x". In fact we want to generate
-
- @add 4 y -> x
-
- and give the result "x". Therefore, when code-generating an operator with
- a result (such as @add), we look at the operator above (which is guaranteed
- still to exist): if it is <set-variable-equal> then we get its left operand,
- write the result of the operator (such as @add) to that, and replace the
- <set-variable-equal> operator with the variable. For instance,
-
- set-var-equal x
- / \
- x + becomes just
- / \
- 4 y
-
-
- One method used to throw away the results of expressions in void context
- is to write the result to temporary variable 255 rather than the stack.
- This is how unwanted function call return values are disposed of (well,
- unless the Z-machine version number is 5 or more, in which case instructions
- for call-but-throw-away-result are used directly).
-
- A few other optimisations are used in void context: for example,
-
- ++ used as postfix on variable
- |
- var
-
- generates the code
-
- @push var
- @inc var
-
- with result "sp" in a quantity context. In void context, it generates
- simply
-
- @inc var
-
- This is quite an important optimisation since postfix-++ is used very often
- in void context (in "for" statements and just as an assignment).
-
-
- 7.6 Conditional and logical operators
- ---------------------------------
-
- The above account covers code generation for arithmetic and assignment
- operators.
-
- In generating && and ||, nothing needs to be done except possibly to
- assemble a label (if the attribute label_after isn't -1).
-
- Generating for conditions like == is in principle easy but in practice
- difficult. Recall that we have three attributes to go on:
-
- a | b labels to branch to on true | false,
- one which is -1 meaning "carry on"
- to_expression flag indicating "convert to numerical value"
-
- One problem is that, because of the "or" operator, a condition like ==
- may have an arbitrary number of children:
-
- if (x == a or b or c or d or e) ...
-
- for instance, must be assembled into the instructions:
-
- @je x a b c ?L2
- @je x d e ?~L1
- .L2
- ...
- .L1
-
- (and if x is a use of the stack pointer, it must be pulled into some temporary
- variable so that it can be non-destructively read more than once). The code
- to achieve this is baroque but uninteresting.
-
- The second question is how to convert a condition to a numerical value:
- either 0 (false) or 1 (true). At present, this is not very optimised, and
- the code looks like:
-
- <if condition is true jump to L1>
- @push 0
- @jump L2
- .L1
- @push 1
- @jump L2
-
- with the result being "sp".
-
-
- 7.7 System functions
- ----------------
-
- When generating the <function call> operator, Inform looks at the leftmost
- operand (the function to call) to see if it's the name of a system function.
- (Note that if the system function has been Replaced, then it will have been
- filtered out long since and the name will have been treated as the symbol
- name for a function in the usual way.)
-
- These are all implemented as short inline functions (sometimes only one
- instruction long) except for "metaclass", implemented as a veneer routine.
-
- Note that
-
- random
- / | |
- a b c ....
-
- nodes are generated by creating a word array whose contents are a, b, c, ...
- and then assembling code to read a random entry from this array. (An
- error is issued if a, b, c, etc. are found not to be constant values.)
-
-
- 8 Assembly of code, text and data structures
- ------------------------------------------
-
- 8.1 Assembly of initial code
- ------------------------
-
- The front end of "asm.c" consists of three routines to assemble routine
- start and finish, and to place labels; plus many routines for assembling
- individual instructions. These all fill out data in the
- assembly_instruction AI and then call a single assemble-instruction routine.
- Thus, the back end of "asm.c" contains four operations to place data into
- the Z-machine's code area:
-
- Routine start assemble_routine_header()
- Routine end assemble_routine_end()
- Label assemble_label_no()
- Instruction assemble_instruction()
-
- For the format of Z-code instructions, see the Z-Machine Standard Document,
- which goes into the Z-machine's format with exhaustive thoroughness. Note
- that Inform 6 is much more compliant to the standard (so far as I know,
- perfectly compliant) than earlier releases: for example, the Flags 2 bits
- for "UNDO is needed" and so on are set correctly for the first time.
-
- One of the few pieces of information fed back to Inform's higher levels by
- "asm.c" is a flag indicating whether execution can possibly reach the next
- instruction or not. The assembly of certain instructions (returning from
- a routine, terminating the Z-machine, unconditionally jumping) sets the
- flag
-
- execution_never_reaches_here
-
- while the placing of a label removes it again (because by branching to that
- label it may be possible to reach this point after all). This is used to
- issue dead code warnings (and also to remove some unnecessary branches
- when code-generating "if"..."else"...).
-
- Assembling a routine header is straightforward: note that an option exists
- here to assemble some code along with it to print out tracing information.
- For this reason, and also to print out assembly trace at compile time,
- "asm.c" accesses the variable names in the symbols table.
-
- All labels are referred to in Inform's higher layers by number, rather than
- by address; only "asm.c" has any access to what their addresses are. Labels
- are numbered from 0 upwards, but the assembler also recognises
-
- -2 branch only to the next instruction (this converts a branch
- instruction such as @get_child to a non-branch one, since it
- causes the instruction to behave identically whether the branch
- is made or not)
- -3 return false rather than branch
- -4 return true rather than branch
-
- The code -1, used higher up for "no branch intended", should never descend
- this far.
-
- Note that the Z-machine has no formal concept of "routine end". Therefore,
- assemble_routine_end() only generates code to cause a return (and only then
- if the present instruction position can be reached). By long-standing
- (if unfortunate) Inform convention, the value returned in this case will
- be 0 for a routine in an object definition, or 1 for any other routine.
-
- The routine-start/end routines also keep track of the usage of local
- variables and labels, and issue warnings about those which were declared
- but not used. Since the scope of these symbols is restricted to the routine
- in which they are defined, label names are unassigned (i.e. reset to
- UNKNOWN_SFLAG CONSTANT_T's). Local variable names are not held in the
- symbols table and are automatically cleared when the next routine starts.
- Finally, we can now detect the use of a label name not defined in the routine,
- so we issue an error message if this has happened.
-
-
- What happens in this first phase of code assembly is that a form of Z-code
- called "initial code" is stored in a temporary holding area of memory until
- a routine has been completed. It is then treated by the branch optimiser
- (see the next section).
-
- "Initial code" is the same as Z-code, except that:
-
- (a) all branches to labels have long form, and instead of an offset
- the branch value is the label number within this routine
- (0, 1, 2, ...);
- (b) LABEL operands (to "jump" instructions) are, similarly, label
- numbers of the label to branch to;
- (c) parallel to the holding area is a uchar[] buffer of the same size,
- holding a "marker value" for each byte.
-
- (Note that branches to -2, -3 and -4 are assembled as short-form branches
- and not marked as BRANCH_MV.) The marker values are as follows:
-
- BRANCH_MV This byte is the first of two in a branch operand
- LABEL_MV Ditto but in a LABEL operand
- DELETED_MV (Inserted by the optimiser, rather than at initial
- code assembly.) This byte has been deleted and should
- be skipped over when the code is transferred to
- long-term storage.
- a "module value" marker value
- This byte is the first of two in a LONG_CONSTANT_OT
- operand which should be marked with this value in
- module compilation
- ditto + 128 Ditto but a one-byte SHORT_CONSTANT_OT or VARIABLE_OT
- operand
-
-
- 8.2 Branch offset backpatching and optimisation
- -------------------------------------------
-
- There are three phases of this process:
-
- (i) Deciding whether each branch should or should not be optimised
- to the 1-byte short form, using the label positions in the
- initial code;
- (ii) Recalculating the label positions to take account of this
- (which will tend to move the labels closer together as second
- bytes of branch operands are deleted);
- (iii) Transferring the initial code to long-term storage, replacing
- branch and LABEL operand values with address offsets and
- issuing module markers as indicated (if in module mode).
-
- A branch can be "short form" if it is a forward branch of between 0 and 29
- bytes from the PC position of the next instruction.
-
- To record a branch as "short form" during (i), the second byte of the branch
- operand is marked as DELETED_MV.
-
- The long-term storage alluded to is either temporary file 2 or else the
- zcode_area memory block. Note that the routines moved there are still not
- finally correct, but are "raw Z-code": Z-code which has incorrect operands
- because value backpatching has not yet occurred.
-
-
- 8.3 Text translation: ISO and Unicode resolution
- --------------------------------------------
-
- The algorithm encoding ZSCII text to a compressed sequence of 5-bit
- "Z-characters" has been documented many times: see the Z-machine standards
- document.
-
- The section of Inform responsible, "text.c", keeps text for three Z-machine
- areas: the static strings area, the dictionary and the "low strings pool".
- (In addition, as requested by "asm.c", it encodes text into the Z-code area
- as part of the assembly instructions @print and @print_ret.)
-
- Static strings are written to a temporary holding area of memory, manipulated
- until correct and then transferred to long-term storage: either temporary
- file 1 or the static_strings_area memory block.
-
- The low strings pool holds a table of strings in low Z-machine memory used
- for abbreviations. Such strings need to be accessible in an unusual way
- (they are addressed by halved byte addresses, not packed addresses) owing
- to an anomaly in the Z-machine architecture, and therefore must reside in
- the bottom 64K of memory: they cannot be written to the static strings area.
-
-
- Source text arrives at the text translation routines still encoded in an
- ISO extension to the ASCII character set (depending on the current value
- of the -C switch). A character value of $ca might mean Latin E-circumflex,
- E-ogonek, Cyrillic capital hard-sign, Arabic Teh or Greek capital Kappa
- according to this context. It would be illegal in plain ASCII or ISO
- Hebrew, but since the Inform lexer has already filtered out any illegal
- ISO characters, this possibility no longer arises.
-
- Furthermore, characters can be specified by string escapes beginning with
- the @ character. The sequence @^E would specify Latin E-circumflex, and
- the sequence @{39A} would specify Greek capital Kappa (by its Unicode
- value, $039A) regardless of the current -C setting.
-
- The work of sorting out character codes is handled by "chars.c". It may
- be useful to summarise the different meanings which character codes can
- hold within Inform:
-
- (1) "Source". Raw source files can contain any character values
- between $00 and $ff. The lexer filters these characters to reduce
- them to:
-
- (2) "ISO". An unsigned 8-bit character code within which the values
- $20 to $7e always have their usual ASCII meanings, and $7f to $9f
- are always undefined, and some values in the range $a0 to $ff
- may have meanings as specified in ISO 8859-1 to -9 if the current
- Inform -C setting is 1 to 9. If -C is set to 0, "ISO" means the
- same as plain ASCII.
-
- (3) "Unicode". Unicode, or ISO 10646-1, is an unsigned 16-bit
- character code which includes essentially every human script.
- Inform uses it as an intermediate state between the diverse
- ISO possibilities and the final ZSCII result.
-
- (4) "Text". A string of ISO characters to specify a sequence of one
- or more Unicode characters. Usually, one ISO character is
- translated to one Unicode character: the exception is for string
- escapes. For instance, the text "a@{39a}c" specifies a sequence
- of 3 Unicode characters.
-
- (5) "ZSCII". An unsigned 10-bit character code used within the
- Z-machine. See the Standards document for its definition: note
- that between $20 and $7e, it agrees with ASCII.
-
-
- There are several translation routines in "chars.c" to move between these
- forms: for instance, "zscii_to_text" is used when printing out the
- contents of the dictionary into a transcript file. (Transcript files
- use the same ISO setting as the original source code.) However, the
- main sequence of translation is:
-
- Source -------> Text -------> Unicode -------> ZSCII
- lexer ISO-to- Unicode-to-
- filtration Unicode; ZSCII
- string escape
- parsing
-
- Note that the first two stages (lexer filtration and ISO to Unicode)
- depend only on the current -C setting, and are always possible. The
- third stage (Unicode to ZSCII) is more complex because it is programmable
- and may fail, depending on what the text is needed for.
-
- Unicode already specifies over 30000 different characters and
- compromise is inevitable in trying to squash this into the 1024 possible
- ZSCII values. The Z-machine's stock of "extra characters"
- (see Standard Document 1.0) is configured by the story file to
- correspond to a block of up to 97 Unicode characters, and these
- are the only ones usable in Inform strings, dictionary words or
- as quoted values.
-
- Inform sets up this block to a sensible default set given the
- current ISO setting. For example, if Inform reads in source code under
- -C6 (ISO Arabic) then it will choose all the Arabic letters from
- ISO Arabic. (This default setting can be overridden using the
- "Zcharacter" directive.) Note that if the designer wants, say,
- Unicode $274b ("heavy eight teardrop-spoked propeller asterisk")
- as well as plain old $0621 (Arabic letter Hamza), this can simply
- be added to the stock accumulated so far.
-
- The stock of extra characters is defined by a list of Unicode values
- written out as the "Unicode translation table" in "tables.c".
-
-
- 8.4 Dictionary management
- ---------------------
-
- As dictionary words are found in the source code, for instance in the value
- of "name" properties, or in grammar, or in constants like 'this', they are
- added to the dictionary by a routine called dictionary_add(), which numbers
- each different word in "accession order": word 0 is the first which was
- entered, and so on.
-
- References to dictionary words are actually stored in the Z-machine as
- byte addresses within the dictionary table, which cannot be known until after
- the compilation pass (when the full alphabetical ordering is known). Such
- references are therefore always marked (with DICT_MV) for backpatching.
-
- During compilation a dictionary table similar to the Z-machine format is
- kept: there is a 7-byte header (left blank here to be filled in at the
- construct_storyfile() stage in "tables.c") and then a sequence of records,
- one per word, in the form
-
- <Z-coded text> <dict_par1> <dict_par2> <dict_par3>
- 4 or 6 bytes byte byte byte
-
- The difference between this and the final Z-machine dictionary table is
- that the records occur in accession order, not alphabetical order. (The
- construct_storyfile() routine in "tables.c" rearranges them.)
-
- The Z-machine does not specify what use is made of the three bytes of
- "dictionary parameters" (it doesn't even specify that there have to be only
- three): for this, see the next section.
-
-
- The type "dict_word" is used for a structure containing the (4 or) 6 bytes of
- Z-coded text of a word. Usefully, because the PAD character 5 is < all
- alphabetic characters, alphabetic order corresponds to numeric order of
- Z-encoding (regarding the Z-encoded bytes as a 32 or 48 bit big-end-first
- number; sign is unimportant as the initial bit is never set). For this
- reason, the dict_word is called the "sort code" of the original text word.
-
- A special text-translation routine is used to "prepare" such a sort code:
- as dictionary words do not use every feature possible in ordinary text
- translation (abbreviations, for instance, or two of the string escape
- forms) it's worth having a separate routine, optimised for speed. Note
- that this routine makes use of "text_to_unicode" and "unicode_to_zscii"
- in "chars.c" when, but only when, it hits a string escape character @.
- (For ordinary ISO characters, when no string escape is hit, these routines
- would be too slow.)
-
-
- Since dictionaries typically contain 500 to 2000 words, speed is extremely
- important when searching and sorting: a simple O(n^2) algorithm to search the
- dictionary, for example, greatly increases Inform's total compilation time on
- medium-size games. (Here "n" is the number of dictionary words.)
-
- Inform 1 to 3 used a crude O(n^2) mass comparison algorithm, which was
- unacceptably slow. Inform 4 to 6.05 used a form of hash table, with the
- first character of a word as hash function (there were 27 hash values, for
- A to Z plus "other"): this behaved very acceptably most of the time, but was
- ultimately still O(n^2) and the uneven distribution of initial letters in
- English meant that the hashing function was not ideal.
-
- Inform 6.1 instead stores the dictionary as a "red-black tree". (See, e.g.,
- Robert Sedgewick, "Algorithms" (1983, second ed. 1988).) The tree is binary
- and always has the property that at node X, anything hanging from the left
- branch of X is alphabetically before what's at X, and anything hanging from
- the right is alphabetically after. For instance, a legal configuration is:
-
- fieldmouse
- / \
- abacus patrician
- / \
- a constantinople
-
- This is simple enough to build and to search through. The problem is that
- one usually ends up with a haphazardly arranged or "unbalanced" tree in which
- some branches are very much longer than others, so that searching times can be
- unpredictable, and in worst case the construction process is still O(n^2).
-
- Thus one wants a method of building the tree which attempts to restore the
- balance as it does so. The algorithm here involves "colouring" each branch
- either red or black (that is, storing one bit of information for each branch).
- Roughly speaking, suppose X is added as a branch of Y. Then the branch
- leading down from Y to X -- the new branch -- is coloured black. But the
- branch leading down to Y -- which was already there -- is recoloured red.
-
- We do not permit the tree to contain two consecutive red branches: whenever
- we notice this happening, we perform a "rotation". A typical rotation is:
-
- a a
- \ \
- bag cat
- / \(red) ---> (red)/ \(red)
- baa cat bag dog
- / \(red) / \
- bun dog baa bun
-
- It's like lifting the "bag" subtree and re-hanging it from the word "cat".
- (There is another case, when the red branches slope in opposite directions.)
-
- We also try to percolate redness upwards (redness being the possibility for
- change in the tree structure). So, when searching, if we ever find
-
- \(black)
- node
- (red)/ \(red)
-
- then we replace it with
-
- \(red)
- node
- (black)/ \(black)
-
- and check the "rotate if you get two consecutive red branches" rule again.
-
- It is known that:
- (i) if there are N words in the tree, then any search takes less than
- 2 log-to-base-2(N) + 2 comparisons (i.e., this is the maximum depth
- of the tree);
- (ii) at worst you need to perform only one rotation for every four
- comparisons.
- In practice, the observed values are even better: the average seems to
- be log-to-base-2(N) comparisons and 1 rotation (though this is unproven).
- Trees are almost optimally balanced most of the time.
-
- For an Inform game with a dictionary of 2000 words, then, we expect to perform
- about 12 comparisons per search. This compares with a typical figure of 75
- using the Inform 6.05 hashing algorithm. Comparisons are not terribly
- expensive, but searches are quite frequent.
-
-
-
- 8.5 Format of tables unspecified by the Z-machine
- ---------------------------------------------
-
- The Z-machine standards document requires many tables to have particular
- formats and locations: but Inform makes several other tables, and this
- section is to document their formats. (See the construct_storyfile()
- routine in "tables.c" for exactly how the byte-addressable Z-machine memory
- is put together.)
-
- Details of differences in modules as compared with story files are given
- later.
-
- (i) Header
-
- Following standard conventions, Inform: leaves bits 2 and 3 of "Flags 1"
- clear; places the release number in the word at bytes 2 and 3; the grammar
- table address at bytes 14 and 15; and the serial number in bytes 18 to 23.
- The default release number is 1 and the default serial number is either
- 970000 (if no access to the current date is available) or the compilation
- date in the form YYMMDD.
-
- In addition, Inform writes its version number in ASCII form into bytes
- 60 to 63 (the last four bytes of the header) in the form "6.01". This
- makes story files produced by Inform 6 distinguishable from all other
- story files, something interpreter writers have long wanted.
-
- (ii) Low strings pool and abbreviations table
-
- By default, the 96 entries are all " " (this Z-encodes to the $80 $00,
- which makes it a convenient default value). The first two banks of 32
- are used for strings declared by Abbreviate; the third is set at run time
- by the "string" statement, to values which have previously been declared
- by Low_string.
-
- The pool of low strings is always placed at $0040, immediately after the
- header and before the abbreviations table.
-
- (iii) Header extension table
-
- The Z-machine standard allows games not to provide a header extension
- table (formally called the "mouse data table" after the only information
- which Infocom ever stored in it) and Inform 6.01 to 6.11 never do.
- Inform 6.12 and later always generate this extension, which always
- contains at least 3 word entries.
-
- (iv) Character set table
-
- This is a table of 78 = 3*26 bytes, giving the Z-character numbers of
- the characters to appear in the three translation alphabets. The table
- is only written if the "Zcharacter" directive has been used to alter
- the standard settings. (See section 12.2.)
-
- (v) Unicode translation table
-
- This is generated (by Inform 6.12 and later) only when needed, i.e., only
- when a game has asked to use a stock of "extended characters" which
- differs from the usual ISO Latin1 set. Games compiled under the ISO
- setting -C0, -C1 (the default setting) and -C9 will usually never
- have a Unicode translation table, though they can have.
-
- (vi) Property default values table
-
- Note that Inform defines properties 1, 2 and 3 automatically: 1 is "name",
- while 2 and 3 are used to support the veneer routines' implementation of
- object orientation.
-
- (vii) Class-number-to-object-number table
- (viii) Individual property values table
-
- See the notes on object orientation; these tables are used by the veneer
- and are unspecified by the Z-machine.
-
- (ix) Grammar table
- (x) Actions table
- (xi) "Preactions" table
- (xii) "Adjectives" table
-
- See the next section. The Z-machine specifies none of these tables.
-
- (xiii) Dictionary table
-
- The format of this table is largely specified by the Z-machine. Although
- the Z-machine allows some flexibility in setting up the dictionary, Inform
- sticks to the usual Infocom configuration:
-
- the separating characters are full stop, comma and space;
- the word-entry length is 7 (in version 3) or 9 (otherwise).
-
- Thus the dictionary begins with seven header bytes:
-
- 3 '.' ',' ' ' 7-or-9 <word containing the number of entries>
-
- Each word is encoded with an entry giving 4 (in version 3) or 6 (otherwise)
- bytes of textual representation, fully specified by the Z-machine, and
- then three bytes of data, which Inform can do with as it pleases.
-
- These are the so-called "dictionary parameters" dict_par1, 2 and 3. Inform's
- pleasure is to write into them as follows:
-
- dict_par1: flags
- dict_par2: verb number (counting downwards from 255)
- dict_par3: preposition number (counting downwards from 255) in
- grammar version 1; not used in grammar version 2
-
- The flags are given as follows:
-
- bit: 7 6 5 4 3 2 1 0
- <noun> <adj> <plural> <meta> <verb>
-
- The bits <verb>, <noun> and <adj> are set if the word can be used in the
- context of a verb, noun and/or preposition (all three can be simultaneously
- set). The <meta> bit indicates that the English verb is "meta", that is,
- is a command to the program and not a request in the game.
-
- The <plural> bit is set using the '...//p' notation, like so:
-
- 'egg' 'eggs//p'
-
- Library 6/3's parser uses this to automatically detect commands referring
- to multiple game objects.
-
- (xiv) Module map
-
- This is an extension to the header, only used in module files (see later).
-
- (xv) Linking data table
-
- A table placed after the end of static strings, but only in module files
- (see later).
-
- To summarise, then, an Inform game has the following memory map:
-
- -----------------------------------------------------------
- Dynamic Header
- memory Low strings pool
- Abbreviations table
- Header extension
- Character set table (if differs from normal)
- Unicode table (if differs from normal)
- Common property default values (*)
- Object tree
- Common property value tables (*)
- Class number to object number conversion table
- Property names table (+)
- Individual property value tables (*)
- Global variable values, part of... (*)
- ...Dynamic array space (*)
- -----------------------------------------------------------
- Static Table of grammar table addresses
- byte The grammar tables (one for each Inform verb) (-)
- addressed Table of action routine addresses (+)
- memory Table of parsing routine addresses (+)
- Table of "adjectives", i.e., prepositions (+)
- Dictionary
- (Module map)
- Synchronising gap of up to 7 null bytes
- -----------------------------------------------------------
- Static Z-code area (*)
- "virtual" Static strings area
- memory (Link data table)
- -----------------------------------------------------------
- (*) This area requires full value backpatching
- (+) This area requires a simple automatic correction
- (e.g. adding on the static strings or code offsets),
- which is done without the full backpatching machinery
- (-) In grammar version 1, this area requires no backpatching,
- but in GV2 each grammar token's data is checked to see
- if a dictionary word or routine address, and backpatched
- if so: so in GV2 it comes under category (+)
-
-
- 8.6 Grammar version numbers GV1 and GV2
- -----------------------------------
-
- The "grammar tables", used by the parser in an adventure game, are not
- specified by the Z-machine at all (contrary to popular opinion) and must
- therefore be fully specified here. There are four such tables:
-
- Grammar table
- Actions table
- "Preactions" table
- "Adjectives" table
-
- Inform can generate two different formats for these tables, known as
- grammar version 1 (henceforth GV1) and grammar version 2 (GV2).
-
- Inform makes the GV1 or GV2 decision according to the value of the symbol
-
- Grammar__Version
-
- which Inform usually defines as 1, but allows programs to redefine. Library
- 6/3 and later, in particular, contain the line
-
- Constant Grammar__Version = 2;
-
- Note that it is essential for the library's parser to understand the
- grammar tables being generated by Inform, so attempting to use GV2 with
- library 6/2 or earlier would fail.
-
-
- The designs of GV1 and GV2 have some points in common. The Grammar table
- is a --> array containing addresses of grammars, one for each Inform verb.
- The Actions table is a --> array containing addresses of action routines
- (such as "TakeSub"), one for each action. (Fake actions have no action
- routines and aren't in the table.)
-
- (The term "Inform verb" means essentially a common parsing grammar, such as
- that shared by "take" and "hold", and is used to distinguish this from an
- "English verb", such as the word "take".)
-
-
- GV1 is at heart an imitation of middle-period Infocom table formats, and
- was used in order that common debugging tools would still work on Inform
- games (an important consideration in the early days of debugging Inform 1).
- In GV1, an individual grammar table has the format:
-
- <number of grammar lines> 1 byte
-
- followed by that many 8-byte lines in the form:
-
- <parameter count> <token1> ... <token6> <action number>
- -- byte --------- --byte-- --byte-- -- byte -------
-
- The parameter count is the number of tokens which are not prepositions.
- This is needed because the run of tokens is terminated by null bytes (00),
- which is ambiguous: "noun" tokens are also encoded as 00.
-
- The numerical token values are as follows:
-
- "noun" 0
- "held" 1
- "multi" 2
- "multiheld" 3
- "multiexcept" 4
- "multiinside" 5
- "creature" 6
- "special" 7
- "number" 8
- noun=Routine 16 + parsing-routine-number
- Routine 48 + parsing-routine-number
- scope=Routine 80 + parsing-routine-number
- attribute 128 + attribute number
- 'preposition' adjective number
-
- illegal values: 9-15, 112-127
-
- This one-byte value has to identify particular prepositions and routines,
- which is only possible using a numbering system for each. GV1 numbers
- parsing-routines upwards from 0 to 31, in order of first use. A separate
- table translates these into routine packed addresses: the "preactions"
- table. (As usual, the term is traditional: Inform has no concept of
- preaction, but the Infocom games from which it inherits GV1 do have such a
- concept.) The preactions table is a simple --> array.
-
- (Note that in Infocom's games the preactions table always has the same
- length as the actions table: this is not true in either GV1 or GV2 Inform
- games.)
-
- Prepositions are also identified by their "adjective number". (An early
- misunderstanding in Z-machine decipherment led to the traditional use of
- the word "adjective" for dictionary words explicitly written into grammar
- lines, which are mainly prepositions like 'into' or 'against'.) Adjective
- numbers count downwards from 255 in order of first use. They are translated
- back into dictionary words using the "adjectives table".
-
- The adjectives table contains 4-byte entries:
-
- <dictionary address of word> 00 <adjective number>
- ----2 bytes----------------- ----2 bytes-----------
-
- To make life more interesting, these entries are stored in reverse order
- (i.e., lowest adjective number first). The address of this table is
- rather difficult to deduce from the file header information, so the constant
- #adjectives_table is set up by Inform to refer to it.
-
-
- GV2 is a much more suitable data structure, easier to read and write,
- less limiting and marginally faster and more economical of Z-machine and
- Inform memory. In GV2 an individual grammar table has the format:
-
- <number of grammar lines> 1 byte
-
- followed by that many grammar lines. Individual lines are no longer always
- 8 bytes long, as in GV1. Instead they have the form:
-
- <action number> <token 1> ... <token N> <ENDIT>
- ----2 bytes---- -3 bytes- -3 bytes- -byte--
-
- The action number is actually contained in the bottom 10 bits of the word
- given first: the top five are reserved for future use, which leaves
-
- action_number & $400
-
- as a bit meaning "reverse the order of the first and second parameters
- if this action is to be chosen".
-
- The ENDIT marker is the number 15. There can be anything from 0 to 30
- tokens, and each occupies three bytes, arranged as:
-
- <token type> <token data>
- -- byte ---- --2 bytes---
-
- Token type bytes are divided into the top two bits, the next two and the
- bottom four.
-
- The "next two bits" are used to indicate alternatives. In a sequence of
- tokens
-
- T1 / T2 / T3 / ... / Tn
-
- then T1 will have $$10 in its "next two bits", and each of T2 to Tn will
- have $$01. Tokens not inside lists of alternatives always have $00. (Note
- that at present only prepositions are allowed as alternatives, but the
- format is designed to open the possibility of extending this to all tokens.)
-
- The bottom four are the "type" of the token. The top two indicate what kind
- of data is contained in the token data. Strictly speaking this could be
- deduced from the bottom six bits, but it's convenient for making backpatching
- GV2 tables a simple matter within the compiler.
-
- Type Means Data contains Top bits
- 0 illegal (never compiled)
- 1 elementary token 0 "noun" 00
- 1 "held"
- 2 "multi"
- 3 "multiheld"
- 4 "multiexcept"
- 5 "multiinside"
- 6 "creature"
- 7 "special"
- 8 "number"
- 9 "topic"
- 2 'preposition' dictionary address 01
- 3 noun = Routine routine packed address 10
- 4 attribute attribute number 00
- 5 scope = Routine routine packed address 10
- 6 Routine routine packed address 10
-
- GV2 makes no use of "adjective numbers" (so that dict_par3 is always zero
- in GV2's dictionary words) and leaves both the adjectives table and the
- preactions table empty.
-
- There is one further difference between GV1 and GV2: in GV1, fake actions
- are numbered from 256 upwards; in GV2, from 4096 upwards.
-
-
- Note that although GV2 takes three bytes for a token as compared with GV1's
- one byte, omission of the redundant null tokens and adjective table means
- that when compiling a small "shell" library game, GV2 actually produces
- more economical tables: 1920 bytes as opposed to 2337.
-
- The first two entries in the following table are the real reason for GV2:
-
- Limit in GV1 Limit in GV2
-
- Prepositions per game 76 unlimited
- Parsing routines (general ones,
- noun= filters, scope= routines
- all put together) per game 32 unlimited
- Tokens per grammar line 6 unlimited
- Actions per game 256 4096
- Inform verbs per game 256 256
-
- In practice the Inform compiler restrains the number of verbs (but that's
- an adjustable memory setting) and lines per verb: in Inform 6.05 and earlier,
- the maximum number of lines per verb is 20. Inform 6.10 internally stores
- grammar roughly as GV2 even when it's going to compile GV1 at the end, and
- this allows a more economical use of Inform's memory: as a bonus, then, the
- maximum number of lines per verb is raised to 32 in Inform 6.10.
-
-
- 8.7 Value backpatching
- ------------------
-
- Value backpatching is the final translation phase in Inform. It is the
- process of correcting temporary values which were written into the Z-machine
- at a time when final values could not be known.
-
- In addition to the Z-machine regions marked in the memory map above, the
- symbols table also undergoes value backpatching: defined Constant symbols
- flagged as CHANGE_SFLAG are backpatched as needed, before first use of their
- values in backpatching something else.
-
- The positions of such values have been "marked" with a "marker value"
- indicating the type of value needed. The backpatchable marker values are:
-
- Marker value Significance of temporary value
-
- STRING_MV Scaled address within static strings area
- ARRAY_MV Byte address within dynamic array area
- IROUTINE_MV Scaled address within Z-code area
- VROUTINE_MV Veneer routine number (a *_VR value)
- NO_OBJS_MV None
- INCON_MV "Inform constant": the index in the #constants
- keyword group (a *_SC value)
- DWORD_MV Accession number of dictionary word
- INHERIT_MV Byte address within common property values table
- of the value which is being inherited here
- INHERIT_INDIV_MV Ditto, but for individual property values table
- INDIVPT_MV Offset in individual property values table
- MAIN_MV None
- SYMBOL_MV Index within symbols table
-
- and these are backpatched as follows:
-
- Marker value Final value
-
- STRING_MV Packed address of static string
- ARRAY_MV Byte address
- IROUTINE_MV Packed address of routine
- VROUTINE_MV Packed address of veneer routine
- NO_OBJS_MV Number of objects in object tree
- INCON_MV Value of constant (typically an address of some
- Z-machine table)
- DWORD_MV Byte address of word's entry in dictionary table
- INHERIT_MV The value to inherit (after it has been backpatched)
- INHERIT_INDIV_MV The value to inherit (after it has been backpatched)
- INDIVPT_MV The byte address of this point in the individual
- property valyes address
- MAIN_MV Packed address of "Main" routine
- SYMBOL_MV Value of symbol (after it has been backpatched)
-
- The code NULL_MV is sometimes used to indicate "no marker", i.e., that a
- value is correct as written. Additional marker values also exist for
- operands held within modules:
-
- IDENT_MV Property identifier number
- ACTION_MV Action number
- OBJECT_MV Object number
- VARIABLE_MV Variable number
-
- (see chapter 10). Note that modules are not value-backpatched at compilation
- time, but at Link time.
-
- Two different memory blocks are used to store backpatch markers: one for the
- "Z-machine image", the byte-addressable memory at the bottom of the memory
- map; and another for the Z-code area. In the interests of economy, these use
- different formats to hold marker data: see "bpatch.c".
-
-
- 8.8 Packed address decoding
- -----------------------
-
- Recall that the formula relating a packed address P to its byte address B is:
-
- B = 2P version 3
- 4P versions 4 and 5
- 4P + 8R versions 6 and 7: routines
- 4P + 8S versions 6 and 7: static strings
- 8P version 8
-
- In versions 6 and 7, R and S can be chosen fairly freely in the range 0 to
- M/8, where M is the minimum address of a routine or static string as
- appropriate. The point is to expand the address space by making higher B
- values available (whereas P has to lie within 0 and 65535): so one wants to
- make R, S larger rather than smaller. On the other hand, it's necessary to
- the Inform language that the following are all different:
-
- (a) the number 0;
- (b) valid object numbers;
- (c) packed addresses of routines;
- (d) packed addresses of strings.
-
- To be sure that (c) and (d) differ, Inform sets S = R. To keep (c) from
- (a) and (b), we must ensure that the packed address of every routine is at
- least X, where X is slightly more than the largest object number. Inform
- also knows the byte address M of the lowest routine in memory ("Main__").
- Note that
-
- M = 4X + 8R
-
- Inform therefore nudges up M and X to ensure that M and 4X are both divisible
- by 8, and sets
-
- R = (M - 4X)/8
- S = R.
-
- To compare this with the code: Write_Code_At is M; extend_offset is X;
- scale_factor and length_scale_factor hold the magic constants 4 and 8.
- code_offset and strings_offset are the scaled addresses of the first
- routine and the first static string respectively, worked out by
-
- code_offset = 4X
- strings_offset = 4X + size of code area
-
- The reason for having these is that when compilation was actually happening,
- Inform did not know enough to calculate packed addresses, and instead
- wrote scaled addresses starting from 0 for each area. In backpatching,
- then, Inform adds the above offsets to each packed address, and then they
- come right.
-
-
- 9 The run-time veneer
- -------------------
-
- 9.1 Services provided by the veneer
- -------------------------------
-
- The "veneer" is a thin section of code generated by Inform as an intermediate
- layer between the code compiled from the source, and the Z-machine itself:
- like the veneer on a table, it gilds the surface of the Z-machine, or so
- the term is supposed to mean.
-
- It consists of 27 routines, 4 class-objects, 2 properties and 2 global
- variables, together with several arrays in the Z-machine's dynamic array
- area.
-
-
- The "veneer.c" section of Inform organises the compilation of the routines;
- each one is compiled only if actually needed, and if no over-riding definition
- has already been given in the source code. (For instance, "PrintShortName"
- is a veneer routine but the Inform library provides a much fuller version
- than the one in "veneer.c".)
-
- Note that the full Inform source code for these routines is present in
- static strings in "veneer.c" (which are compiled using the lexer's "read
- from a string" mode).
-
- The routines come in four groups. Note that names with double underscores
- in are so chosen not to clash with identifiers used by the programmer.
-
- Main__ This is not really a routine at all, but is the piece of code
- which the Z-machine's PC is initially set to. It simply
- makes a function call to Main(), and then quits.
-
- Box__Routine Code to display an array of static strings in a black
- "quotations" box, as required by the "box" statement.
-
- Printing routines PrintShortName, DefArt, CDefArt, etc.
- All normally over-ridden within the Inform library.
-
- Object orientation routines
- Routines to implement message sending, access to individual
- properties, "metaclass", "ofclass" and "provides".
-
- Only the fourth group is interesting and the rest of this chapter is about
- the veneer's implementation of the "object orientation" features new in
- Inform 6 and which for the first time provide a major data structure not
- present in the Z-machine architecture.
-
- Since veneer routines are compiled at the end of the pass (when it's known
- which will be needed), the code area looks like this:
-
- start of code area --> 00 (routine header for Main__)
- initial PC value --> @call_1n Main
- @quit
-
- ... routines as defined in source code...
-
- veneer routines as required, in order
- of appearance in the "veneer.c" table
-
- (Note that Main__ is assembled as a routine. In the Version 6 Z-machine, this
- is called to begin execution. In other versions, the initial PC value is
- set to the call_1n instruction -- which is why Main__ must be first: the
- initial PC value has to lie in the bottom 64K of Z-machine memory.
- Note also: for the Versions 3 and 4 Z-machine more primitive call opcodes are
- used to get into Main.)
-
-
- The four objects in the veneer are the class-objects for the four metaclasses
- defined in all Inform programs: Class, Object, Routine and String. The
- object tree therefore looks like this:
-
- 1 Class
- 2 Object
- 3 Routine
- 4 String
- ... objects and class-objects as defined in source code ...
-
- The two global variables are "self" and "sender", used in message-passing,
- and which are two of the 7 highest-numbered global variables which Inform
- reserves to its own use as "registers".
-
-
- 9.2 Properties 2 and 3, "ofclass" and "metaclass"
- ---------------------------------------------
-
- Inform reserves common properties 2 and 3 to itself, and uses them as follows:
-
- property 2 (additive): a word array containing the class-object numbers
- of all classes of which this object is a member; the property is
- absent if the object belongs to no classes. (Note that metaclasses
- do not appear here: all non-class-objects are members of Object,
- but Object is not listed here.)
-
- property 3: a byte address pointing to the individual properties table
- for this object; property 3 is absent if it has no individual
- properties.
-
- The veneer tests "X ofclass Y" according to the rules:
-
- if X is a valid Z-machine object number:
- if X is a class-object (tested by: if X is a child of Class)
- then only if Y is Class
- otherwise only if Y appears in the property 2 list for X
-
- if X is the packed address of a routine: only if Y is Routine
-
- if X is the packed address of a string: only if Y is String
-
- giving a run-time error if Y isn't a class-object. Note that determining
- which of these ranges contains X needs code essentially the same as that
- in the Inform 5/12 library routine "ZRegion": indeed, this is how the
- veneer routine implementing the "metaclass" function works.
-
-
- 9.3 Class numbering and class objects
- ---------------------------------
-
- In Inform 6 each class definition results in the creation of a related
- object, the class-object for that class.
-
- Class objects have:
-
- their internal symbol-name as short name;
- no attributes (*);
- no (common) properties except possibly for property 3.
-
- All except for Class are placed in the object tree as children of Class.
-
- Immediately following the property values table for a class (which is bound
- to be short, since it can only contain the short name and perhaps property 3)
- is a six-byte block of attributes which would be inherited from the class (*),
- and then a second property values table, specifying the properties which
- members of the class would inherit from it. (These values are accessed at
- run time to implement the superclass access operator :: when applied to a
- common property.)
-
- Property 3 is used to hold the byte address of the individual properties
- table of properties which members would inherit. It is absent if there
- are no such properties. (These are accessed at run-time to implement the
- superclass access operator :: when applied to an individual property.)
-
- Note that there are two numbering systems used for classes: the "class
- number" counts 0, 1, 2, 3, ... in order of declaration, with Class numbered
- 0. (This is used in superclass message number coding.) Classes are also
- referred to by the object numbers of their class objects. It's necessary
- to be able to convert between the two at run time; the "class numbers table"
- is a word array whose Nth entry is the class-object number corresponding to
- class number N.
-
- [(*) This is a change made in Inform 6.12. Previously, class objects
- had the attributes which would be inherited from the class. 6.12 moves
- this information to a harmless block of six bytes elsewhere, essentially
- so that loops like "objectloop (x has animate)" would not pick up classes
- as well as objects.]
-
-
- 9.4 Individual property identifiers
- -------------------------------
-
- Just as the common properties are identified by property numbers 1 to 63,
- so the individual properties are identified by numbers 64 upward. The
- group 64 to 71 is used to identify the properties provided by inheritance
- from the metaclasses:
-
- Id Name Provided by
-
- 64 create class-objects other than metaclass-objects
- 65 recreate ditto
- 66 destroy ditto
- 67 remaining ditto
- 68 copy ditto
- 69 call objects of metaclass Routine
- 70 print objects of metaclass String
- 71 print_to_array ditto
-
- However, in addition to this a property ID number with its top bit set has
- a special meaning:
-
- Bit 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
- set -------------------------- -----------------------------
- entry number class number
- within class's i.p. table
- (counting from 0)
-
- which means "the given class's inheritable value of the property with this
- ID number", and is used to implement the "superclass" operator as something
- which acts on ID numbers:
-
- Class :: identifier
-
- is implemented by writing the class number into the bottom bits and searching
- as above. Further, a property ID number in the form:
-
- Bit 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
- 00 set --------------------- -----------------------------
- common prop number class number
-
- refers to "the given class's inheritable value of this common property".
-
- As a special rule, in the case of class Object this accesses the Z-machine
- default property value. Thus, since the library defines
-
- Property cant_go "You can't go that way.";
-
- then X.Object::cant_go will always evaluate to "You can't go that way."
- for any Object X.
-
-
- Note that the need to be able to pack such descriptions into 16 bits is one
- reason classes need to have a 0, 1, 2, ... numbering system in addition
- to the class-object system of numbering (which has many lacunae). Even as
- it is, the present veneer implementation constrains us to a maximum of 256
- class definitions, each able to provide at most 128 individual properties
- (in addition to the potentially 63 declared as common using Property);
- more work may be needed to alleviate this at a later date.
-
-
- 9.5 Individual property tables
- --------------------------
-
- For each object that provides any individual properties at all, property
- 3 holds the byte address of its individual property table. The entries
- in this table are pairs of identifier number and current value: thus,
-
- raw identifier number (2 bytes)
- but with top bit set if "private" to the object in question
- property length (1 byte)
- current value (as many bytes as specified in the property length,
- though this will always be an even number >= 2)
-
- terminated by a zero word, 00 00. (Note that there is no property 0,
- common or individual.)
-
- The properties can occur in any order and need not be in numerical order
- of their IDs (which is helpful to assist linking, though it may have speed
- implications). The following Inform code prints out an object's table:
-
- [ IndivTable obj x n;
- x = obj.3;
- if (x == 0) rfalse;
- print "Table for ", (object) obj, " is at ", (hex) x, "^";
- while (x-->0 ~= 0)
- { print (hex) x-->0, " (", (property) x-->0, ") length ", x->2, ": ";
- for (n = 0: n< (x->2)/2: n++)
- print (hex) (x+3)-->n, " ";
- new_line;
- x = x + x->2 + 3;
- }
- ];
-
- [ hex x y;
- y = (x & $ff00) / $100;
- x = x & $ff;
- print (hdigit) y/$10, (hdigit) y, (hdigit) x/$10, (hdigit) x;
- ];
-
- [ hdigit x;
- x = x % $10;
- if (x<10) print x; else print (char) 'a'+x-10;
- ];
-
-
- From an inheritance point of view, individual properties are always
- non-additive: a specified value overrides and replaces an inherited value.
- (The idea is that access to superclass values is a more elegant alternative
- to using additive properties to make class values remain available to
- inheritors which have altered them.)
-
-
- The veneer contains routines to implement "provides", ".", ".&" and ".#",
- together with routines to apply ++ and -- as pre- or postfix operators,
- and a routine to implement = in the context of individual properties.
- These all lapse back into ordinary Z-machine instructions for common
- properties if they find a property number in the range 1 to 63.
-
- Note that the veneer's implementation of "provides" contains a special
- rule meaning that class-objects do not provide any properties except the
- five inherited from the metaclass Object, even though they have a table
- in the above format (which contains the inheritable values).
-
-
- The veneer's highest-level routine sends a message to an individual
- property. (This same routine contains implementations of the eight messages
- numbered 64 to 71.) Note that the trickiest part is keeping "self" and
- "sender" right:
-
- push the current "self" value
- push the current "sender" value
- set "sender" to "self"
- set "self" to the destination object
- now make the function call
- pull to "sender"
- pull to "self"
-
-
- Note that the object creation/deletion system works in a quite simple way:
- the pool of uncreated duplicate members of a class are exactly the children
- of its class-object. Objects are re-initialised when they are returned to
- this pool, leaving them "fresh" and ready for recreation.
-
-
- 9.6 Availability of symbol names at run-time
- ----------------------------------------
-
- It is not possible to determine at compile-time if a message is wrongly
- addressed, i.e., if a message is sent to property X of object O but object O
- does not provide X. For one thing, availability depends on who is calling
- (if the property is private to O); for another, message identifiers do not
- need to be constants.
-
- Therefore a good error-reporting mechanism is needed at run-time; errors like
- "Object 32 has no property 57" would make debugging hopelessly difficult.
- The only way to achieve this is for the property names to be available at
- run time. Similarly, it's helpful for debugging purposes if attribute
- names and action names can also be available.
-
-
- To this end, Inform compiles a table of identifier names into every
- story file (though not into modules) and provides a system constant called
- #identifiers_table pointing to the start of the table. The table contains:
-
- --> 0 = Number of property names plus 1 = P
- --> 1 = Packed address of string containing name of property 1
- to
- --> (P-1) = Packed address of string containing name of property P-1
- (properties 1 to 63 will be common, and subsequent ones
- individual: this table includes both kinds; note that
- there is no property numbered 0)
-
- --> P = Packed address of string containing name of attribute 0
- to
- --> (P+47) = Packed address of string containing name of attribute 47
-
- --> (P+48) = Packed address of string containing name of action 0
- and so on.
-
- In the case of (common) property and attribute names, the ability of
- programmers to "alias" names together means that these names will sometimes
- be given as a list of possibilities with slashes in between.
-
- Note that printing property names so as to properly handle usage of ::
- is harder than simply looking up in this table (for instance, there are
- times when the right output should look like "Coin::value"). Inform
- provides the construct
-
- print (property) X;
-
- to handle this. But the corresponding cases for attributes and actions
- are easier. Simply define:
-
- [ DebugAction a anames;
- if (a>=256) print "<fake action ", a-256, ">";
- anames = #identifiers_table;
- anames = anames + 2*(anames-->0) + 2*48;
- print (string) anames-->a;
- ];
- [ DebugAttribute a anames;
- if (a<0 || a>=48) print "<invalid attribute ", a, ">";
- else
- { anames = #identifiers_table; anames = anames + 2*(anames-->0);
- print (string) anames-->a;
- }
- ];
-
- (These are in fact both part of Library 6/3 anyway: a different DebugAction
- routine is present in Library 6/1 and 6/2 if DEBUG is set.) You can then:
-
- print (DebugAction) X;
- print (DebugAttribute) X;
-
-
- It is also for error reporting that the names of the classes are needed at
- run-time, which is why the short name of a class-object is the name of the
- class.
-
-
- 10 Compiling and linking modules
- -----------------------------
-
- 10.1 Model
- -----
-
- The model for linking is that a game being compiled (called the "external"
- program) may Link one or more pre-compiled sections of code called "modules".
- Suppose the game Jekyll has a subsection called Hyde. Then these two
- methods of making Jekyll are (almost) equivalent:
-
- (i) Putting Include "Hyde"; in the source code for "Jekyll", and
- compiling "Jekyll".
-
- (ii) Compiling "Hyde" with the -M ("module") switch set, putting
- Link "Hyde"; into the same point in the source code for "Jekyll",
- and compiling "Jekyll".
-
- Option (ii) is of course much faster if "Hyde" does not change very often,
- since its ready-compiled module can be left lying around while "Jekyll"
- is being developed.
-
- Option (ii) also slightly increases story file size, as some code
- optimisations are impossible this way: e.g., linking rather than including
- the library costs about 160 bytes on final story file size.
-
-
- In the linking process, two incarnations of Inform need to communicate with
- each other:
-
- A. one in the middle of compiling a story file
- and wanting to link a module into it;
-
- B. one compiling a module for future linking into
- some unknowable story file.
-
- To be more precise, B needs to send information to A. It does so by writing
- a modified form of story file with additional tables attached, and this is
- what a module is.
-
- We discuss how Inform copes with situation A first, as this indicates the
- information which B needs to send it; we then discuss how B gathers and
- sends this information.
-
- Note that A and B can never apply in the same run-through of Inform, since
- a module cannot link another module into it.
-
- Finally, note that A and B may be different ports of Inform running on
- different machines at different times, since the module format is
- machine-independent.
-
-
- 10.2 Linking a module
- ----------------
-
- There are two tasks: (a) merging the module's data structures into those of
- the external story file, and (b) mending all the constants used in these data
- structures which couldn't be known at module compilation time.
-
- (a) Merging data structures
-
- This is formidably hard, since it means extracting tables in Z-machine format
- and putting them back into Inform's own (usually higher-level) data
- structures: in effect, an inverse is needed for each translation operation.
- The following table briefly lists the data structures which are merged, and
- how this is done:
-
- Structure Method
-
- Initial global variable values Directly copy in
- Initial array entry states Append
- Code area Append. Z-code is relocatable except
- that function calls are to absolute
- addresses: these are handled as though
- they were references to constants
- unknown at module compilation time
- Static strings area Append
- Dictionary Extract words from the module, one at
- a time, and insert into the story
- file's dictionary. (The dictionary
- is just too complex to merge the
- internal structures.)
- Object tree Append, and also fix all the object
- numbers used in parent, sibling and
- child fields: a particular concern
- is with classes defined in the
- module, which have to be added as
- children of the main program's
- Class object
- Property values area Append
- Individual property values Append
- "Flags 2" header bits Bitwise OR
- Class to object numbers table Effectively append, moving the object
- numbers up appropriately. The
- module's class numbers table contains
- offsets of the class inheritance
- property blocks as well, and these
- are appended too
-
- (b) Mending references
-
- The problem here is to absorb numerous references within the module into
- the backpatch table for the story file. It is not possible just to append
- the module's backpatch tables onto the story file's backpatch tables:
- all the offsets need adjusting, and certain kinds of backpatching need to
- take place immediately (to fix the four marker values not allowable in
- story file backpatch tables, ACTION_MV, IDENT_MV, VARIABLE_MV and OBJECT_MV).
-
-
- 10.3 Imports and exports
- -------------------
-
- The main "extra" ingredients in a module (compared with an ordinary story
- file) are: two backpatching tables (see the next section), and an
- import/export table giving the names of symbols to be shared with the main
- story file which will link in with it.
-
- The language of "imports and exports" considers the module to be the home
- country, and the external program to be foreign. (An exported symbol's value
- is assigned in the module and used in the external program; an imported
- symbol's value is assigned in the external program and used in the module.)
-
- An export is a symbol defined in the module which may be referred to
- in the external program. All variables, routines and general constants
- defined in the module except for (a) system-defined symbols always present
- (such as "Class", "true", etc.) and (b) attributes, common properties and
- labels are exported.
-
- An import is a symbol name used in compiling the module which isn't defined
- within it. During module compilation, just as in story file compilation, if
- an unknown name is hit then an operand is constructed which is marked
- SYMBOL_MV and has the symbol index as its value. In the case of a story
- file, at value backpatching time the correct value is found out and written
- in. In the case of a module, there is no value backpatching phase.
-
- Instead, all symbols are flagged with IMPORT_SFLAG if ever used when undefined
- (i.e., if any SYMBOL_MV marker is ever issued). Any which remain undefined
- at the end of module compilation (in fact, at the same point where exports
- are made) are "imported": that is, their names are written down so that their
- existence can be checked on at Link time. It is an error for any imported
- symbol not to exist in the external program performing the Link.
-
- Note, therefore, that backpatch markers in the module with marker value
- SYMBOL_MV may refer either to symbols which were assigned values later on
- during module compilation (and are thus exported) or to symbols which were
- never assigned values (and are thus imported).
-
-
- 10.4 Backpatch backpatching
- ----------------------
-
- Although modules undergo branch backpatching and optimisation just as story
- files do, they do not undergo value backpatching.
-
- Instead, the two value-backpatching tables (zcode_backpatch_table and
- zmachine_backpatch_table) are appended to the module file. The basic idea
- is that when the external file merges the module's data structures into its
- own, it also adds the module's backpatch markers into its own collection.
- Thus, the module's code and data are finally value-backpatched when the
- external program is value backpatched.
-
- Unfortunately, these backpatch markers have addresses which are correct for
- the module, but wrong for the external program: they require correction
- before they can be transferred, and this is called "backpatch backpatching".
-
- In addition to this, a few special types of backpatch markers (only ever
- generated inside modules, never in story files) are dealt with immediately.
-
- A further complication is that the module may have exported a symbol which
- itself has a value needing backpatching. For example, if the module contains
-
- Constant Swine 'hog';
-
- then, in its symbol table, the symbol Swine will have the marker DWORD_MV
- attached to its value. When this is exported to the external program, the
- symbol value will need backpatch-backpatching.
-
-
- The sequence of events is roughly:
-
- 1. Load the module into an allocated block of memory.
- 2. Merge the dictionary, finding out which dictionary
- accession numbers in the external program correspond to
- which in the module.
- 3. Go through the import/export table, until we know
- which symbol numbers in the module correspond to which
- symbol numbers in the external program; and which
- variable numbers to which; and which action numbers
- to which; and which property identifiers to which.
- (Arrays called "maps" are created to encode all these.)
- 4. Backpatch, or deal with the backpatch markers attached to,
- any exported symbols from the module.
- 5. Go through the Z-code backpatch table:
- deal with IDENT_MV, ACTION_MV, VARIABLE_MV and OBJECT_MV markers
- immediately, by backpatching the copy of the module
- held in memory;
- backpatch the other markers and then transfer them
- into the external program's Z-code backpatch table.
- 6. Do the same for the Z-machine backpatch table.
- 7. Now merge the memory copy of the module into the external
- program.
-
- (There are actually 17 stages, but most of the latter ones are mergers of
- different tables which can happen in any order.)
-
- As an example of "backpatch backpatching", the backpatch marker for the
- quantity 'hog' will be DWORD_MV, with value the accession number of the word
- "hog" in the module's dictionary. This accession number needs to be changed
- to the accession number of "hog" in the story file's dictionary.
-
- The four "module only" backpatch marker values are:
-
- VARIABLE_MV global variable number in module's set
- OBJECT_MV object number in the module's tree
- IDENT_MV identifier number in the module's set
- ACTION_MV action number (always between 0 and 255, though
- always in a "long" constant so that a fake
- action number can be backpatched over it)
-
- Note that within the module, there is no way to tell an externally defined
- action from a fake action. That is, the reference ##Duckling might be
- to an external action or fake action called "Duckling". Within the module,
- it is created as an action name in the usual way; at Link time, the action
- map corresponds this module action number to a fake action number or a real
- one as required.
-
- Variable numbers in the module are not marked if they are local (or the
- stack pointer), or if they are from the system set ("self", "sender" and so
- on), whose numbers are guaranteed to be correct already. In a module, the
- available variable numbers are 16 to N-1, where N is the number of the lowest
- system-used variable (at present, 249). Imported variables are numbered
- from 16 upwards, and variables defined newly in the module from N-1
- downwards. If these allocations collide in the middle, then the Z-machine
- has run out of variables.
-
-
- 10.5 How modules differ from story files
- -----------------------------------
-
- Module files are identical to story files, except that they have not
- undergone value backpatching, and do not contain any part of the veneer; and
- except as detailed below.
-
- (i) Size
- ---------
- A module is at most 64K in size.
-
- (ii) Header entries (where different from story file header entries)
- --------------------------------------------------------------------
- Byte Contents
-
- 0 64 + V (where V is the version number)
- 1 module version number
- 6,7 byte address of the module map table
- (note that this is the "initial PC value" slot in the
- story file header format, but no such value is meaningful
- for modules)
-
- V is used to make sure that we aren't trying to link, e.g., a Version 5
- module into a Version 8 story file: this would cause heaps of backpatch
- errors, as the packed addresses would all be wrong. Although we could
- in principle automatically fix them all up, it isn't worth the trouble:
- the user only needs to compile off suitable versions of the modules.
-
- The module version number is a version number for the module format
- being used: this document describes module version 1.
-
- (iii) Class numbers table
- -------------------------
- In a module, the class numbers table contains additional information,
- and has the form:
-
- <word containing object number for Class>
- <word containing class 1 inheritance-properties block offset>
- <word containing object number for class 1>
- <word containing class 1 inheritance-properties block offset>
- <word containing object number for class 2>
- <word containing class 2 inheritance-properties block offset>
- ...
- <word containing object number for class N>
- <word containing class N inheritance-properties block offset>
- 00 00
-
- (In a story file, the inheritance-properties block offsets are absent.)
-
- (iv) Identifier names table
- ---------------------------
- This is missing altogether in a module.
-
- (v) Dictionary
- --------------
- A module dictionary is in accession order, not alphabetical order. (This is
- necessary so that backpatch markers in the module, which refer to dictionary
- words by accession number, can be connected with their original words at
- link time.)
-
- (vi) Module map
- ---------------
- The module map is (currently) 15 words long and contains the following:
-
- Word 0 byte address of object tree
- 1 byte address of common property values table
- 2 scaled address of static strings table
- 3 byte address of class numbers table
- 4 byte address of individual property values table
- 5 number of bytes in indiv prop values table
- 6 number of symbols in module's symbols table
- 7 number of property identifier numbers used in module
- 8 number of objects present in module
- 9 byte address of import/export table
- 10 its size in bytes
- 11 byte address of Z-code backpatch table
- 12 its size in bytes
- 13 byte address of Z-machine image backpatch table
- 14 its size in bytes
-
- The map serves as an extension to the header, which has more or less run out
- of available slots for new pieces of information.
-
- (vii) Parents of class objects
- ------------------------------
- Objects representing classes are given in the object tree as having
- parent $7fff (and having next-object 0, as if each were an only child).
- These are to be understood as being children of the Class object (which
- is not present in a module).
-
- (viii) Import/export table
- --------------------------
- This is a sequence of symbols (mixing up imports and exports in no particular
- order), occupying records as follows:
-
- Record type (1 byte)
- Symbol number in module symbols table (2 bytes)
- Symbol type (1 byte)
- Symbol backpatch marker (1 byte)
- Symbol value (2 bytes)
- Symbol name (null terminated)
-
- where the possible record types are:
-
- IMPORT_MV import a symbol
- (in which case the symbol backpatch marker
- is omitted)
-
- EXPORT_MV export a symbol defined in a non-system file
- EXPORTSF_MV export a symbol defined in a system file
- EXPORTAC_MV export an action name (in the form
- "Name__A")
-
- Note: we need to distinguish between EXPORT_MV and EXPORTSF_MV in order to
- make Replacement of routines work properly. Suppose the user asks to Replace
- the routine DrawStatusLine, normally found in the parser, and then links the
- parser module and also a module of his own containing his new definition of
- DrawStatusLine. The linker knows which of these to accept because the first
- was compiled from a system module, and thus exported with EXPORTSF_MV, while
- the second was not, and was exported with EXPORT_MV. Other than for this
- purpose, the two values are treated equivalently.
-
- (ix) Z-code backpatch table
- ---------------------------
- This is exactly the module's Z-code backpatch table, no different in format
- from that used for backpatching story files (except that four additional
- marker values are permitted).
-
- (x) Z-machine image backpatch table
- -----------------------------------
- This is exactly the module's Z-machine backpatch table, no different in
- format from that used for backpatching story files (except that four
- additional marker values are permitted).
-
-
- 10.6 Restrictions on what modules may contain
- ----------------------------------------
-
- [1] It is impractical to allow the module and the external program use of
- each others' attribute and property names, because there is no rapid way
- of repairing the module's object tables. Instead, both program and
- module must use the same Include file to make the same attribute and
- property definitions as each other.
-
- [2] An object defined in a module cannot:
- (a) have an externally-defined parent object,
- or (b) inherit from an externally-defined class.
-
- [3] A module can only use externally-defined global variables if they have
- been explicitly "imported" using the Import directive.
-
- [4] A module cannot use Verb or Extend: that is, it cannot define grammar.
-
- [5] A module cannot use Stub or Default (for obvious reasons).
-
- [6] A module cannot use "unknown at compile time" constants in every
- context. (Unknown in that, e.g., 'frog' and MAX_FROGS might be
- unknown: the former because the dictionary address of 'frog' in the
- final program can't be known yet, the latter (say) because MAX_FROGS
- is a constant defined only in the external program.) In the
- following list of contexts in which constants occur in Inform source
- code, those marked with a (*) are not permitted to be "unknown at
- compile time":
-
- 1. In high-level source code (including as a switch value and
- a static string in a "box" or "string" statement).
- 2. In assembly language source code.
- 3. As the initial entries in an array.
- 4. As the initial value of a variable.
- 5. In a CONSTANT definition.
- 6. As an object's (common or individual) property value.
- * 7. As a release number. (Defining the module release number only.)
- * 8. As a version number. (Modules are always version 5.)
- * 9. In grammar. (You can't define grammar in a module.)
- * 10. As the size of an array.
- * 11. In a DEFAULT constant definition. (You can't use Default.)
- * 12. In an IFTRUE or IFFALSE condition.
- * 13. As a property default value (though this is unnecessary since
- the definitions in the outside program take precedence).
- * 14. As the number of duplicate objects provided for a class.
-
- Only (10) and (14) are restrictive. Combining a module with a short
- Include file to make such definitions will get around them.
-
-
- 11 Service levels
- --------------
-
- 11.1 Variables and arrays
- --------------------
-
- Each of the sections is responsible for the initialisation of its own
- variables, and for the allocation and deallocation of its own arrays. To
- this end, every section except "inform.c" (which is considered to be in
- charge of the others) provides a set of four routines to manage its
- data structures:
-
- init_*_vars() called when the compiler is beginning a run
- *_begin_pass() called at the beginning of the compilation
- pass through the source code: this usually sets
- some of the variables
- *_allocate_arrays() called when the compiler is beginning a run
- *_free_arrays() called when the compiler is ending its run
-
- Note that *_free_arrays() must exactly undo what *_allocate_arrays() has
- done.
-
- The 19 routines of each kind are polled in no particular order (actually,
- in alphabetical order).
-
- Note that there are also
-
- lexer_begin_prepass()
- files_begin_prepass()
-
- which need to happen even earlier than the general *_begin_pass() poll,
- and also
-
- lexer_endpass()
- linker_endpass()
-
- but since the other sections generally don't need to do anything at end
- of pass, it wasn't worth making a formal poll of it.
-
- Static initialisation of variables is only allowed when they are certain
- never to change (for example, the string containing the accent escape
- characters never changes).
-
-
- 11.2 Memory allocation and deallocation
- ----------------------------------
-
- All allocation and deallocation is routed through routines given in
- "memory.c": specifically,
-
- my_malloc() and my_free()
-
- Note that these are called with a string as an extra parameter (as compared
- with malloc() and free()), which is used to print helpful error messages in
- case of collapse. This is where the -m ("print the memory allocation") switch
- is implemented, and it's also a good place to add any OS-specific code which
- may be needed.
-
- This may be needed because, under ANSI rules, malloc() is entitled to refuse
- to allocate blocks of memory any larger than 32K minus a few bytes. On a few
- implementations malloc() actually does this, even on very large machines,
- because it's convenient to avoid memory segmentation problems. (For instance,
- on a few versions of C for the Macintosh, it's possible to have 16M free
- and still be refused a request for 33K of it.) Consequently it may be
- necessary to use some compiler-specific routine instead.
-
- Inform allocates many arrays (50 or so) and in a normal compilation run,
- the largest of these occupies about 26K. However, the implementation of
- Link requires the whole of a module file to be held in one contiguous block
- of memory, potentially 64K long; and the same applies to the array holding
- the unpaged memory area of the Z-machine during construct_storyfile().
-
- Failure of my_malloc() causes a fatal error.
-
-
- Inform also provides a structure (memory_block) for allocating extensible
- regions of memory (at a slight access-speed penalty): these allocate
- 8K chunks as needed. Three of these are used as alternatives to the
- temporary files when -F0 is operating, and two more are used to hold
- backpatch markers.
-
-
- 11.3 Error messages
- --------------
-
- All diagnostic and error messages are routed through the "errors.c" section
- (except for errors in parsing ICL, as this takes place before the compiler
- itself is active).
-
- There are three kinds of diagnostic: warning, error and fatal error.
- Fatal errors cause Inform to halt with an exit(1), which is about as fatal
- as anything can be to a C program. (Fatal errors are mainly issued when the
- memory or file-handling environment seems to have gone wrong.) Errors
- allow compilation to carry on, but suppress the production of any output
- at the end: in some ways this is a pity, but there are not many errors which
- one can be confident of safely recovering from.
-
- Note that a sequence of MAX_ERRORS (normally 100) errors causes a fatal error.
-
- Error messages which are generated during the compilation pass try to quote
- the source code line apparently responsible (unless the "concise" switch
- is set): but Inform may be unable to do this if the text was read and lost
- too long ago, for example in the case of a "declared but not used" warning.
-
- Note that informational messages (such as the banner line, the statistics
- printout, etc.) are not routed through "errors.c", or indeed anywhere else.
-
- Error recovery in Inform is mainly a matter of going into "panic mode"
- (amusingly, this is a term of art in compiler theory): tearing through the
- tokens until a semicolon is reached, then starting afresh with a new
- statement or directive. This pretty often goes wrong (especially if a
- routine start/end is not noticed), and more work might be needed here.
-
-
- 11.4 File input/output
- -----------------
-
- The "files.c" section handles almost all of the file input/output in Inform:
- the two exceptions are in cli_interpret() in "inform.c", which loads in ICL
- files; and in link_module() in "linker.c", which loads in modules (a simple
- job which there was no point abstracting).
-
- The biggest shake-up to this section since Inform 5 was the realisation that,
- far from being a grubby low-level activity, working out good filenames was
- in fact a task which had to be done much higher up in Inform: so all of
- the code which used to do this is now located in the ICL interpreter in
- "inform.c".
-
- What remains in "files.c" is:
-
- opening source code and reading it into buffers;
- opening and closing the temporary files;
- calculating the checksum and length of, and then outputting, the
- story file or module being compiled;
- opening, writing to and closing the text transcript file;
- opening, writing to and closing the debugging information file.
-
- For each different source file that has ever been open, "files.c" maintains
- a structure called FileId, containing the full filename and a file handle.
- The files are never read from except via the routine file_load_chars(),
- which fills up the lexical analyser's buffers.
-
- The routine output_file() writes the story/module file: see its comments
- for details. It takes up where construct_storyfile() leaves off, and
- contributes the last two pieces of header data to be worked out, the
- checksum and the length fields. Note that the checksum calculation is
- only known after the whole file has been written (mainly because Z-code
- backpatching alters the checksum, but can't be done until output time without
- consuming a good deal of extra memory); so that an "fseek" is made to skip
- the write position back into the header and overwrite the correct checksum
- just before closing the file. This call is made in a strictly ANSI way
- but, like everything on the fringes of the C library-to-operating-system
- interface, may cause trouble on some compilers.
-
- The other routines are largely self-explanatory.
-
- There are three temporary files (although they won't be used if -F0 applies),
- as follows:
-
- Temporary file Contents
-
- 1 The static strings area
- 2 The Z-code area
- 3 The link data area (only opened and used
- in -M, module mode)
-
- For the format of the debugging information file, see the utility "infact.c",
- which prints from it. This format is likely to change in the near future
- in any case, as it is already inadequate for some of the new features in
- Inform 6.
-
-
- 12 Low-level language features
- ---------------------------
-
- 12.1 Using the "Trace" directive
- ---------------------------
-
- The trace directive is primarily intended for compiler maintenance purposes.
- It can cause Inform to print tracing information on nine different aspects
- of what it does, in most cases at several levels of detail.
-
- -------------------------------------------------------------------------
- Tracing option Level Output
- -------------------------------------------------------------------------
- "assembly" 1 all assembly language produced, in
- an imitation of "@..." syntax
- 2 and also the actual bytes produced,
- in hexadecimal
- (Note that because trace printing
- occurs before label optimisation,
- addresses of instructions cannot be
- relied on: nor can the values of
- operands which are marked for later
- backpatching.)
-
- "tokens" 1 lexemes of all tokens output from the
- lexer to the syntax analyser, and
- indication when a token is put
- back
- 2 full token descriptions
- 3 and also the lexical context in which
- they were identified
-
- "expressions" 1 annotated parse trees for all
- expressions being code-generated
- 2 and the behaviour of the shift-reduce
- parser when parsing it, and
- the parse tree received by the
- code generator (not yet annotated)
- 3 and the token-to-etoken translations
- made, and the parse tree produced
- by the emitter before lvalue
- checking
-
- "linker" used in 1 list all import/export information sent
- compiling module 2 and all marker information sent
-
- "linker" used in 1 list all modules linked in
- compiling story file 2 and all import/export information
- received
- 3 and all marker information received
- 4 and how each marker was dealt with
-
- "lines" --- (currently inoperable)
- -------------------------------------------------------------------------
- "dictionary" 1 print current dictionary contents
- (including verb/preposition nos.)
- in alphabetical order
-
- "objects" 1 print current object tree
-
- "verbs" 1 print current grammar
-
- "symbols" 1 print out those entries in the symbols
- table which are not: unknown,
- generated by the veneer or in a
- system file
- 2 print entire symbols table
- -------------------------------------------------------------------------
-
-
- 12.2 System constants and other secret syntax
- ----------------------------------------
-
- The addresses of many important tables in the Z-machine are not recorded
- in the header, or anywhere else: but they are known to the compiler, and
- needed by the run-time code. The system constants are provided mainly
- as a way of passing this information into run-time code.
-
- Constant Evaluates to
- --------------------------------------------------------------------------
- #version_number Version number of Z-machine format being compiled to
- #dict_par1 Byte offset in a dictionary table entry of the
- the first ("flags") parameter byte
- #dict_par2 And the second ("verb") byte
- #dict_par3 And the third ("adjective") byte
- (note that these three depend only on the version
- number)
-
- #largest_object The largest object number constructed, plus 256
- (the "+256" is due to a quirk in the implementation
- used by Inform 3 or so; since this constant is not
- a very public feature, I don't mind leaving it in)
-
- #adjectives_table Byte addresses of adjectives table,
- #actions_table actions table,
- #preactions_table "preactions table",
- #classes_table class number to object number table,
- #identifiers_table property ID names table
-
- #code_offset Packed address of Z-code area
- #strings_offset Packed address of static strings area
- --------------------------------------------------------------------------
-
- Two more secret syntaxes were introduced in Inform 6.10. The first changes
- the behaviour of the "Include" directive:
-
- Include "Language__";
-
- (and no other string) includes the current language definition file, whose
- name is an ICL variable.
-
- The second controls which grammar table format is generated: normally GV1,
- but this can be set to GV2 by
-
- Constant Grammar__Version = 2;
-
- The "Grammar__Version" symbol is redefinable; if no such Constant directive
- is made, then it will have the value 1. It needs to be changed to its
- final value before the first Verb, Extend or Fake_Action directive is
- reached.
-
-
- 12.3 The "Zcharacter" directive
- ------------------------
-
- Finally, the "Zcharacter" directive is provided mostly for the benefit
- of language definition files, for configuring Inform games to use a
- non-English alphabet or character set. (See the Inform Translator's
- Manual.) Different forms of "Zcharacter" allow both the Z-machine
- alphabet table and the Unicode translation table to be specified.
-
-
- (i) The Z-machine alphabet
-
- The Z-machine's text encryption system is optimised to make it especially
- cheap on memory to use letters in alphabet 1, then cheapish to use letters
- in alphabets 2 and 3 but rather expensive to use letters which aren't in
- any of the three. We aren't much concerned about lack of memory in the
- game as a whole, but the economy is very useful in dictionary words,
- because dictionary words are only stored to a "resolution" of nine
- Z-characters. Thus, in a dictionary word:
-
- something from alphabet 1 costs 1 Z-character
- 2 or 3 2 Z-characters
- outside the alphabets costs 4 Z-characters
-
- The standard arrangement of these alphabets (A1 lower case a to z, A2 upper
- case A to Z, A3 numerals and punctuation marks) includes no accented
- characters. In a language with frequent accented or non-English letters,
- such as Finnish or French, this means that 4 of the 9 Z-characters in
- a dictionary word may be wasted on just one letter. For instance,
-
- 't@'el@'ecarte' is stored as 't@'el'
- 't@'el@'ephone' is stored as 't@'el'
-
- (there are not even enough of the 9 Z-characters left to encode the second
- e-acute, let alone the "c" or the "p" which would distinguish the two words).
- On the other hand if e-acute could be moved into Alphabet 3, say in place of
- one of the punctuation marks which is never needed in dictionary words,
- the two e-acutes would take just 2 Z-characters each and then
-
- 't@'el@'ecarte' would be stored as 't@'el@'ecar'
- 't@'el@'ephone' would be stored as 't@'el@'epho'
-
- which is far more acceptable.
-
- The Z-machine has a mechanism (at least in Version 5 or better) for changing
- the standard alphabet tables, invented to make the German translation of
- Infocom's "Zork I" work. The "Trace dictionary" will print the current
- contents of the alphabet table (as well as the dictionary contents).
-
-
- (i).1 Moving a single character in
-
- There are two ways to change the standard English alphabet table.
- One way, which is probably good enough for a language definition file
- for a mostly Latin language (where only up to around 10 accented or
- non-English letters are commonly needed) is to move characters into
- the least-used positions in Alphabet 2. For this, use the directive:
-
- Zcharacter <char>;
-
- It will only be possible if there's a letter in A2 which hasn't yet
- been used (otherwise, changing that entry in A2 will make some of the
- text already compiled wrong). The directive is thus only practicable
- early in compilation, such as at the start of the library definition file.
-
- For instance the code
-
- Trace dictionary;
- Zcharacter '@'e';
- Zcharacter '@`a';
- Zcharacter '@^a';
- Trace dictionary;
-
- might produce output including...
-
- Z-machine alphabet entries:
- a b c (d) e (f) g (h) i j k l m n o (p)(q) r s t u v (w)(x)(y)(z)
- A (B) C (D) E (F)(G) H I (J)(K) L (M)(N) O (P)(Q) R S (T)(U)(V)(W)(X)(Y)(Z)
- ( ) ^ 0 1 (2) 3 (4)(5) 6 (7)(8) 9 (.) , (!)(?)(_)(#)(')(~) / (\) - (:)(()())
-
- Z-machine alphabet entries:
- a b c (d) e (f) g (h) i j k l m n o (p)(q) r s t u v (w)(x)(y)(z)
- A (B) C (D) E (F)(G) H I (J)(K) L (M)(N) O (P)(Q) R S (T)(U)(V)(W)(X)(Y)(Z)
- ( ) ^ 0 1 @'e 3 @`a@^a 6 (7)(8) 9 (.) , (!)(?)(_)(#)(')(~) / (\) - (:)(()())
-
- ...in which note that bracketed letters are ones which have not been encoded
- yet. The three Zcharacter directives have inserted e-acute, a-grave and
- a-circumflex into the positions previously occupied by the numerals 2, 4 and
- 5. It is reasonable to make up to about 10 such insertions, after which any
- further attempts will only be successful if the game being compiled doesn't
- (let us say) have a title like "123456789: An Interactive Lesson In Counting",
- which would have used 9 of the numerals and forced them to stay in the final
- alphabet table.
-
-
- (i).2 Changing the entire alphabet
-
- This has to be done very early in compilation, before any strings are
- translated, so that it can't be done by a language definition file.
- One might put such directives into a file called "Alphabet.inf" and
- then begin the main game with
-
- Include "Alphabet";
-
- to achieve this.
-
- The form required is to give three strings after "Zcharacter", containing
- 26, 26 and 23 characters respectively. For instance:
-
- Zcharacter "abcdefghijklmnopqrstuvwxyz"
- "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
- "0123456789!$&*():;.,<>@{386}";
-
- (Note that "@{386}" specifies only one character: Unicode $0386.) Space,
- new-line and quotation marks " are automatically included, while
- ~, @ and ^ have special meanings in Inform and should not be used.
- Otherwise, any arrangement of characters is fine, except that every
- character used has to be either a normal ASCII character or part of
- the "extra characters" (already declared) in the ZSCII set.
-
-
- (ii) Defining the "extra characters"
-
- Inform normally makes up a block of "extra characters" based on the
- source code it reads: if it reads plain ASCII or ISO Latin1 (-C0 or
- -C1) then the block contains the usual European accents, such as
- e-acute or i-circumflex, as defined in Z-machine Standard 0.2. (And
- if this table is never changed, Inform doesn't then compile the
- table at all, as this is the default Z-machine arrangement.)
-
- More generally if Inform reads ISO 8859-n (-Cn) then the block is set
- up to contain all the non-ASCII letter characters in ISO 8859-n.
-
- There's room to spare for others to be added, and
-
- Zcharacter table + '@{386}';
-
- would add Unicode character $0386 (Greek capital Alpha with tonos
- accent, as it happens) to the current stock of "extra characters".
-
- Alternatively, you can simply give a fresh stock altogether:
-
- Zcharacter table '@{9a}' '@{386}' '@^a';
-
- would specify a stock of just three, for instance.
-
- These directives must be made before the characters in question are
- first used in game text.
-
-
- 12.4 Sequence points
- ---------------
-
- Inform marks certain positions in the code it compiles as being
- "sequence points". The idea is that the code can be regarded as a
- sequence of chunks, and the sequence points mark where these chunks
- begin. Roughly speaking, each different statement in Inform source code
- compiles to a different chunk, so that statements correspond closely to
- sequence points. Sequence points are marked in assembly trace output
- using the notation "<*>". For instance, the source code
-
- [ WorkOutSquares counter;
- counter = 0;
- while (counter < 100)
- { squares-->counter = counter*counter;
- counter = counter + 1;
- }
- ];
-
- produces the traced output:
-
- 6 +00008 [ WorkOutSquares counter
-
- 7 +00009 <*> store counter short_0
- 8 +0000c .L0
- 8 +0000c <*> jl counter short_100 to L1 if FALSE
- 9 +00011 <*> mul counter counter -> sp
- 9 +00015 storew long_480 counter sp
- 10 +0001b <*> add counter short_1 -> counter
- 11 +0001f jump L0
- 11 +00022 .L1
- 12 +00022 rtrue
-
- The "<*>" in front of an instruction means "the position where this
- instruction begins is a sequence point". We could mark the four
- positions in the original source code as:
-
- [ WorkOutSquares counter;
- <*> counter = 0;
- <*> while (counter < 100)
- { <*> squares-->counter = counter*counter;
- <*> counter = counter + 1;
- }
- ];
-
- Note that the open and close braces and square brackets don't normally
- cause sequence points. The exact rule is that every statement,
- action < > command, assignment or expression in void context is at
- a sequence point, except as shown in the examples below:
-
- for (<*> i=0: <*> i<45: <*> i++) ...
-
- "for" loops contain 0 to 3 sequence points, depending on whether there's
- any code compiled in the three parts of the specification. For instance
-
- for (::) <*> print "Madness!";
-
- contains no sequence point corresponding to the "for" specification.
-
- <*> objectloop (<*> O ofclass Coin) <*> print (name) O;
-
- "objectloop" normally generates two sequence points: at the start,
- where the variable is initialised, and then where it's tested. However,
- loops over the contents of particular objects work differently:
-
- <*> objectloop (O in Mailbox) <*> print (name) O;
-
- (Because the test "O in Mailbox" is not actually being performed at
- run-time: instead, O is looping through the tree.)
-
- do <*> print counter++, " "; <*> until (counter < 17);
-
- Here the sequence point generated by the loop itself is attached to
- the "until" clause, not the "do" clause, because that's where the
- test is performed.
-
-
- "switch", "while" and "if" statements are not exceptions to the
- usual rule (1 statement = 1 sequence point at the beginning), but it
- might be useful to give some examples anyway:
-
- <*> switch(counter)
- { 1: <*> print "One^";
- 2, 3: <*> print "Two or three^";
- default: <*> print "Neither^";
- }
-
- <*> if (i == 17) <*> print "i is 17"; else <*> print "i isn't 17";
-
- <*> while (i<100) <*> print i++;
-
-
- The following is true:
- Except possibly in code explicitly assembled using the "@"
- notation, at each sequence point the Z-machine stack
- is empty and no important information is held in the
- global variables reserved by Inform as "registers":
- thus, it's safe for a debugger to switch execution from
- any sequence point in a routine to any other.
- No two sequence points can be at the same position in either
- the source code or the compiled code.
- Every sequence point corresponds to a definite position in the
- source code (because the veneer, i.e. the code compiled
- from within Inform itself, contains no sequence points).
-
- But the following is _not_ true:
- Sequence points occur in the same order in the source code
- as they do in compiled code
- Every routine contains at least one sequence point
-
- Inform uses sequence points only to generate debugging information
- files for Infix, and to annotate assembly tracing output. They do not
- affect the code compiled.
-
-
- 12.5 Format of debugging information files (for Infix)
- -------------------------------------------------
-
- This is a provisional specification of a format which will probably
- change slightly in future releases. Support for the old -k option has
- been re-introduced in Inform 6.12 to assist development of Infix, the
- projected source-level debugger for Inform. (See the minor utility
- program "infact", updated to 6.12 format, which prints out the contents
- of a debugging information file in legible form.)
-
- A debugging information file begins with a six-byte header:
-
- 0,1 the bytes $DE and then $BF (DEBF = "Debugging File")
- 2,3 a word giving the version number of the format used (currently 0)
- 4,5 a word giving the current Inform version number, in its
- traditional decimal form: e.g. 1612 means "6.12"
-
- The remainder of the file consists of a sequence of records, terminated
- by an end-of-file record. These records may be in _any_ order unless
- otherwise noted. Each record begins with an identifying byte, for which
- constants looking like *_DBR are defined in Inform's source code.
-
- A "string" is a null-terminated string of ASCII chars.
- A "word" is a 16-bit unsigned number, high byte first.
- A "line" is a sequence of four bytes: the first is the file number,
- the next two are a line number (a word), and the last is a character
- number within that line.
- In all three cases -- file numbers, line numbers, character numbers --
- counting begins at 1. The line reference 0:0:0 is however used to
- mean "no such line": for instance, the metaclass "Routine" is defined
- at line 0:0:0, because it's defined by the compiler, not in any
- source code.
- Character positions greater than 255 in any line are recorded simply
- as 255.
- An "address" is a 24-bit unsigned number, a sequence of three bytes
- (high byte, middle byte, low byte). All addresses are counted in
- bytes (rather than being Z-machine packed addresses).
-
- EOF_DBR (byte: 0)
- End of the debugging file.
-
- FILE_DBR (byte: 1)
- <file number> 1 byte, counting from 1
- <include name> string
- <actual file name> string
- One of these records always appears before any reference to the
- source code file in question.
-
- CLASS_DBR (byte: 2)
- <name> string
- <defn start> line
- <defn end> line
-
- OBJECT_DBR (byte: 3)
- <number> word
- <name> string
- <defn start> line
- <defn end> line
-
- GLOBAL_DBR (byte: 4)
- <number> byte
- <name> string
-
- ARRAY_DBR (byte: 12)
- <byte address> word
- <name> string
- The byte address is an offset within the "array space" area, which
- always begins with the 480 bytes storing the values of the global
- variables.
-
- ATTR_DBR (byte: 5)
- <number> word
- <name> string
-
- PROP_DBR (byte: 6)
- <number> word
- <name> string
-
- FAKE_ACTION_DBR (byte: 7)
- <number> word
- <name> string
- Note that the numbering of fake actions differs in Grammar Versions
- 1 and 2.
-
- ACTION_DBR (byte: 8)
- <number> word
- <name> string
-
- HEADER_DBR (byte: 9)
- <the header> 64 bytes
- This is provided in order to check that a debugging information file
- (probably) does match a given story file.
-
- ROUTINE_DBR (byte: 11)
- <routine number> word
- <defn start> line
- <PC start> address
- <name> string
- then for each local variable:
- <local name> string
- terminated by a zero byte.
- Note that the PC start address is in bytes, relative to the start of
- the story file's code area. Routines are numbered upward from 0,
- and in each case the ROUTINE_DBR, LINEREF_DBR and ROUTINE_END_DBR
- records occur in order.
-
- LINEREF_DBR (byte: 10)
- <routine number> word
- <number of sequence points> word
- and then, for each sequence point:
- <source code position> line
- <PC offset> word
- The PC offset for each sequence point is in bytes, from the start of the
- routine. (Note that the initial byte of the routine, giving the number
- of local variables for that routine, is at PC offset 0: thus the actual
- code begins at PC offset 1.) It is possible for a routine to have no
- sequence points (as in the veneer, or in the case of code reading simply
- "[; ];").
-
- ROUTINE_END_DBR (byte: 14)
- <routine number> word
- <defn end> line
- <next PC value> address
-
- MAP_DBR (byte: 13)
- A sequence of records consisting of:
- <name of structure> string
- <location> address
- terminated by a zero byte.
- The current names of structures consist of:
- "abbreviations table"
- "header extension"
- "alphabets table"
- "Unicode table"
- "property defaults"
- "object tree"
- "common properties"
- "class numbers"
- "individual properties"
- "global variables"
- "array space"
- "grammar table"
- "actions table"
- "parsing routines"
- "adjectives table"
- "dictionary"
- "code area"
- "strings area"
- Other names made be added later, and some of the above won't be
- present in all files ("Unicode table", for instance). Locations are
- byte addresses inside the story file.
-
- LINEREF_DBR records will probably be compressed in future releases.
-
-
- 13 What little I remember
- ----------------------
-
- 13.1 Publication history
- -------------------
-
- The language has existed in the following different forms on the Internet:
-
- zass (never publically released)
- Inform 1 v0.5 April 30th 1993
- Inform 2 v0.6 (I can't find any record of this date)
- Inform 3 v1.0 November 17th 1993
- Inform 3a December 7th 1993
- Inform 4 January 20th 1994
- Inform 5 June 1st 1994
- Inform 5.1 June 12th 1994
- Inform 5.2 June 19th 1994
- Inform 5.3 (released on an "Acorn User" cover disc)
- Inform 5.4 October 2nd 1994
- update later in October 1994
- Inform 5.5 v1501 late May/early June (private circulation)
- v1502 June 30th 1995
- Inform 6 v6.01 April 30th 1996: declared a "beta" release
- v6.02 May 5th 1996
- v6.03 May 10th 1996
- v6.04 September 12th 1996: declared no longer "beta"
- v6.05 September 24th 1996
- Inform 6.1 v6.10 December 16th 1996
- v6.11 January 27th 1997
- v6.12 March 26th 1997
- v6.13 April 5th 1997
- v6.14 September 8th 1997
-
- There have been a few other very slightly modified states, consisting of
- bug fixes and source code portability improvements on the above (e.g.
- Inform 4 went through a fairly long process of tidying-up once it reached
- the porters); and a number of CD ROM releases. But these are insignificant
- from a language design point of view and were unnoticed by "the public",
- i.e., by Inform's base of users.
-
- This base of users has continuously grown, from size 1 (myself) to what at
- time of writing is a very large group of at least casual users and a group of
- perhaps 30 "serious" ones whom I actually know. Inform's early versions were
- very hard to use (as will become clear), and as a language it only attracted
- attention because:
-
- (a) it did something undeniably "cool", as aficionados of 1980s adventure
- games mostly agree - it revived the Infocom run-time format,
- and the early Inform manuals were for a while the fullest documents
- on the Web describing the Infocom format; and
- (b) since "Curses" had been written in it, the compiler must at least work.
- (Even though the source code for "Curses" has never been made public,
- the fact that it was known to exist underwrote the project and made
- it seem worthwhile to the many porters who invested considerable time
- and effort on initially complex and poorly written code.)
-
- I first posted Inform to the Internet in the spirit of "glasnost" rather than
- with any pretension that it would be a widely used program. (It worked for
- me, which was all that was then required.) By mid-1994 my ambitions for it
- had grown: in looking back, that first year is the interesting one.
-
-
- 13.2 Design history
- --------------
-
- Inform was designed specifically as a language to express programs which
- would run on the Z-machine, a run-time format which already existed. In
- turn, the Z-machine was designed specifically as a run-time format for
- an existing language (ZIL, or Zork Implementation Language, a form of MDL).
- Inform is therefore the result of a game of Chinese whispers, since I had no
- idea what ZIL looked like until much later.
-
- In fact Inform and ZIL are entirely dissimilar. One reason for this is that
- ZIL is syntactically a very high-level language, not unlike LISP, and its
- compiler (ZILCH) made a substantial translation down to Z-machine level:
- whereas Inform was designed very close to the Z-machine level. It is often
- remarked that Inform resembles C, and although this is partly because some
- syntax conventions were borrowed from C by a designer who felt comfortable
- with it, another reason is that both languages were designed for easy
- translation to efficient low-level code.
-
- Inform began as an assembler called "zass", whose syntax continues to exert
- a (perhaps malign) influence on the language of today. The zass assembly
- mnemonics were different from the modern set (this was at a time when no
- standardisation of opcode names had taken place), and were copied from the
- set of names output by Mark Howell's disassembler "txd". (The opcode
- names recognised by Inform were tinkered with until Inform 5 when the standard
- set was agreed.) But the format of routines and labels remains. A typical
- "zass" routine looked like this:
-
- [ Routine v1 v2 ... vn
- an assembly line
- .alabel
- another assembly line
- ]
-
- and semicolons were introduced as line separators when "zass" became Inform
- 1. Individual assembly lines then took the form
-
- opcode operand1 ... operandn store;
-
- (if they were store operands), with no syntactic indication that the last
- operand was not really an operand but a store destination. This persisted
- until Inform 5, and is still recognised in Inform 6, though the preferred
- syntax is now
-
- opcode operand1 ... operandn -> store;
-
- Right up until Inform 5, similarly, "sp" was a reserved word meaning "the
- assembly-language stack pointer", preventing its use as a local variable
- name (for example). Only in Inform 6 has it disappeared from the set of
- reserved words (except in the context of assembly lines).
-
- One of the most unusual features of Inform is its handling of local variables
- to functions, in that
-
- (a) functions can be called with any number of arguments;
- (b) there's no specification by the programmer of any expected number
- of arguments;
- (c) local variables and argument-receiving variables are treated as
- the same;
- (d) there are at most 15 local variables per routine.
-
- All four arise from the fact that this is how Z-machine programs behave:
- these were the rules for zass, and they're still the rules for Inform 6.
- It's efficient to identify Inform's idea of local variables with the
- Z-machine implementation of locals; nevertheless, the decision to do so
- was made by default.
-
- This illustrates a general point about restrictions in the Inform syntax.
- The Z-machine is rich, well-equipped with features making it a good target
- for compilation. It implements, let us say, feature F subject to a modest
- restriction (for example, it implements local variables subject to a limit
- of 15 per routine). Since this is good enough for most purposes, and
- efficient, and easy to compile to, an early release of Inform decided to
- make use of the Z-machine's implementation directly; and consequently the
- design of the Inform language was shaped around the restriction needed
- to make it possible to use F.
-
- In other words, if a sparer "RISC" target machine had been chosen, Inform
- would have been obliged to make a new implementation of feature F, which
- would not have been subject to arbitrary Z-machine restrictions.
-
- Perversely, it is therefore Inform's greatest weakness (as well as its
- greatest selling point) that it compiles to the Z-machine. At least, though,
- there is a not inconsiderable reward from using the Z-machine's features
- so directly: compact and fast code results.
-
- In any case, the history of Inform is a continuous process of syntax moving
- away from direct expression of the Z-machine and towards a higher-level
- description of a program.
-
-
- Something that partially frustrated this was that Inform acquired a serious
- user base "too soon", before the language design was mature enough: by Inform
- 4, a good many people were using Inform, and this began to make me nervous of
- changing the established syntax. But if a syntax feature is clearly "wrong"
- (that is, anomalous and inconvenient) then clearly one must bite the bullet
- and make a change. In retrospect, I feel I have always been too hesitant
- over this.
-
- For example, I realised when working on Inform 5.1 that the notation
-
- print "You swallow ", (the) food, " with glee.";
-
- to mean "print the name of the object whose number follows, preceding it
- by a suitable definite article", would be desirable. But I had wanted
- Inform 5 to be a point of stability, not rapid change, and so I shied away
- from adding this feature (besides, I was worried about complicating the
- interaction between the library and the language itself). I did not in fact
- add the feature until Inform 5.5, but it would have been easier for all
- concerned had I done so earlier.
-
- The moral to draw would be that any really big changes still needed ought
- to be made now. Unfortunately, they are just too big (the syntax for
- character and dictionary literals, for example, or for array and property
- value initialisation) and so I have failed to learn my lesson, after all.
-
-
- Another problem with an evolutionary design is that vestigial features tend
- to survive, especially if one is trying to serve a user base who may still
- be using old syntaxes. (For instance, using old library files of code.)
- A problem P, say, is given a crude solution S1 in 1993, a slightly better
- solution S2 in 1994 and a proper solution S3 in 1995. By this time, does
- Inform still have to recognise the syntaxes applying to S2 and S1? Usually
- I have decided to make it do so; the formal concept of an "obsolete usage"
- (which would be recognised but cause "please change this" warnings to be
- issued) was not added to Inform until 5.5.
-
- For example, how do we set a property value? In Inform 1, by using a line
- of assembly code:
-
- put_prop lamp oil_left 15;
-
- By Inform 3 this had become painful, and a shift away from assembly language
- was taking place, but not a drastic one. The statement "write" was devised
- for writing a sequence of property values (in this example, just one):
-
- write lamp oil_left 15;
-
- (Even this existed partly because of a horrid form of "object recycling"
- present in Autumn 1993 forms of "Curses" and very early drafts of "Jigsaw",
- causing objects to be heavily over-written: it became necessary to change
- many properties in one go. The "write" statement was a great help.)
-
- Only in Inform 4 did the C-like form appear:
-
- lamp.oil_left = 15;
-
- And not until Inform 6 was it possible to apply ++ and -- to a property
- specified in this way. Likewise, only in Inform 6 was the "write" statement
- finally withdrawn from recognition, though it was flagged as obsolete in
- Inform 5.5.
-
- Such obsolete syntaxes do not merely increase the "clutteredness" of the
- language; they also block more desirable syntaxes by causing clashes. And
- the tendency to provide partial solutions to problems has been unfortunate
- in putting off the point where significant change is needed. I did not
- realise in "zass" that a notation was needed to refer to dictionary words
- as operands of opcodes, automatically creating them if need be; then I did
- realise this, in Inform 1, but made a directive called "Dictionary" to set
- constant symbol names equal to their dictionary addresses; in Inform 3,
- I found this was cumbersome and instead used the syntax #n$... to indicate
- "new dictionary word" (it seemed logical since #w$... was being used for
- "existing dictionary word"); in Inform 5, the syntax '...' was used, and
- this was perhaps a mistake, since it clashes slightly with the syntax for
- specifying literal characters.
-
- If I had realised at Inform 1 that a syntax was needed, this would have been
- avoided. As a footnote, #n$... is still present in the language; #w$...
- and "Dictionary" were finally deleted in Inform 6.
-
-
- A happier point of view on this evolution is that, at every stage, the subset
- of Z-machine features which Inform could use expanded. Inform 1 could produce
- only version 3 games; Inform 4 was a major breakthrough in making version 5
- games; and so on. (V5 did not become the default version compiled to
- until Inform 5, but this reflects a lack of confidence in the interpreters
- for high-version-number Z-machines then available.) And an evolved syntax
- is at least adapted to what its users want of it.
-
-
- (i) Inform 1 and 2
- --------------
-
- Inform 1, then, was the assembler "zass" rewritten and with certain shorthands
- added. For example,
-
- var = number operator number;
-
- was permitted, though full expressions were not. Most of the "creating"
- directives were present -- Object, Global, Property and Attribute -- though
- directives were not syntactically distinguished from statements, and could
- be mixed freely. Nor were statements distinguished from assembly lines:
- indeed, most of them were assembly lines.
-
- In addition, control structures were provided:
-
- if ... else ...
- while ... ...
- do ... until ...
-
- These took their modern form, except that conditions did not need to be
- bracketed, and braces were compulsory around all code blocks, even those
- which contained only one statement. A form of "for" loop was provided,
- but took the BASIC-like form "for var 1 to 10". (This syntax existed
- alongside the modern "for" syntax until Inform 6.)
-
- Grammar and actions existed in their modern form from the first (except
- for the <...>, <<...>> notation for causing them). Object definitions took
- their modern form (except that specifying a parent was compulsory, there was
- no "class" segment and routines could not be given as property values).
-
- And that was about all. The language appears from this description to be
- very primitive, and it was, but it was not so different in appearance from a
- "real" language (except for the lack of compound expressions) and "Curses"
- was written in Inform 1 code for about two years before being translated
- into modern syntax.
-
- Inform 2 was only a tidying-up release of Inform 1.
-
-
- (ii) Inform 3 and 3a
- ---------------
-
- Inform 3 was a more substantial upgrade, with various worthy features
- (command line switches; control over header flag bits) added. But the
- driving force was that "Curses" was being expanded, and running severely into
- the limits of the Z-machine Version 3 architecture: in particular, the limit
- of 255 objects. "Curses" apparently needed a minimum of about 270: Richard
- Tucker suggested to me that there was no need for the 255 physical Z-machine
- objects to correspond directly to the 270 game objects. So the idea of
- "recycling" was born. Object O might be a carpet in one room and a door in
- another: as long as it was impossible for these ever to be simultaneously
- present. When either room was entered, everything about O was over-written
- to make it the appropriate definition.
-
- This made it necessary to be able to drastically alter objects. The "write"
- statement (see above) was introduced to fix properties, a syntax #n$...
- was introduced to refer to dictionary words (because so many were needed as
- property values to over-write with) and the @.. string escape was added
- to make use of the Z-machine's abbreviations table, with the "string"
- statement added for changing entries in it. (This last made it possible for
- the apparently unalterable object short-name to consist of a string reading
- "print abbreviation 1", so that by changing abbreviation 1 it was possible to
- effectively alter short names.) Finally, object declaration syntax was
- altered to make it possible to give the same object more than one symbol
- name.
-
- In retrospect this substantial effort to break the game-object Z-object
- correspondence was misguided; the proper solution to the problem was to make
- Inform Version-5 capable (as happened in Inform 4). It ought to be
- remembered, though, that at this time many people were using Version 3
- interpreters not capable of running Version 5 games (indeed, the Version 5
- release of "Curses" was one of the nudges towards a wider awareness of
- Version 5 and the interpreters capable of running it). Inform hung back
- because interpreters were not ready; yet as events turned out, interpreters
- became ready partly because of Inform overtaking them.
-
- Inform 3a was a tidying-up release of Inform 3.
-
-
- (iii) Inform 4
- --------
-
- Inform truly became a high-level language, rather than an assembler with
- delusions of grandeur, only with the release of Inform 4.
-
- The important new addition was a full expression evaluator, recognising
- operators (and an expanded set of these, including for instance ++ and --),
- precedence levels, brackets and so on. Conditions were also permitted to
- be compound, using the new logical operators && and ||. Properties could
- be referred to using the new "." operator (along with ".&" and ".#").
- System functions (such as "children" and "random") were added.
-
- This was the decisive break of the correspondence between Inform statements
- and Z-machine instructions: expressions no longer looked like three-address
- code by any other name, and might compile to arbitrarily long sequences of
- instructions. Opcodes like "random" were no longer accessed directly, but
- via system functions.
-
- The most annoying syntax point remaining was that braces were still
- compulsory. A typical Inform 3 line might have been:
-
- if i == 2 { print "Two"; }
-
- This restriction was now lifted. Implementation considerations
- (specifically, in the source line-reader which fed into the tokeniser) meant
- that some indication was needed of where the condition ended, and so brackets
- were imposed on conditions. Thus, the corresponding Inform 4 line:
-
- if (i == 2) print "Two";
-
- Making the parallel with C even closer, new-style "for" loops were
- introduced:
-
- for (i=0:i<10:i++) print i;
-
- (and the desire to make Inform capable of compiling "for" loops as complex as
- those available to C caused the operators ++, -- and , to be added). More
- originally, "objectloop" was added.
-
- Although it had little impact on the syntax, to the outside world the major
- development was that Inform was now capable of compiling Version 5 files
- (which had no limit on object numbers and could be twice as large as Version
- 3 ones). The main impact of this on the language (which I now regret) was
- that statements like "box" and "style" were added to control newly available
- text features of the Z-machine.
-
- Otherwise I took the decision that in principle all Inform programs should be
- cross-compilable to all versions (subject to the limits applying in each
- version), and this remains substantially true. Since certain game operations
- such as saving, loading and reading the keyboard were commanded by very
- different assembly language in versions 3 and 5, these were abstracted into
- statements so that, for instance, what looked like version-3 assembly
- language to read the keyboard would in fact be compiled to appropriate but
- different version-5 assembly.
-
-
- (iv) Inform 5
- --------
-
- The final stage in Inform's arrival as a programming language took place a
- little over a year after Inform 1's creation. Only then did I look back in
- leisure at the language, and only then did I begin to seriously find out about
- the rival adventure-game languages to see what they were capable of. I was
- unimpressed except by TADS (which I continue to have a great respect for):
- the point was rubbed home when I was adapting the original mainframe
- Adventure, "Colossal Cave", to Inform for use as an example. Dave Baggett's
- TADS translation was constantly at my side, and it became obvious that the
- corresponding Inform 4 code was far less legible.
-
- In some ways, then, the Inform 5 syntax was worked out by constructing the
- language so as to make the implementation of "Advent" look neat.
-
- Although Inform had, from the first, associated code closely with objects
- (the concept of before/after routines was part of the original conception),
- only now was this brought out in syntax by making explicit routine definitions
- property values. (Previous code had been full of property values like
- "FireplacePre", which were names of what would now be called before-routines.)
- The syntax now seems obvious and natural, but at the time it felt like a
- breakthrough: what had been broken through, I think, was the residual feeling
- that Inform was an assembler at heart, whose programs were a list of routines
- interspersed with some overgrown variable definitions.
-
- Class definitions (and the inheritance rules) followed in the same, somewhat
- bewildering, couple of days, and a brace of aesthetic changes followed:
- dictionary words written 'thus', ##Names for actions (rather than the
- previous #a$Name syntax), the use of bare strings as implied "print_ret"
- statements. Most importantly, the use of <...> and <<...>> to trigger
- actions, and the implementation of what amounted to a switch statement on the
- current action within embedded routines.
-
- I have come to realise that adventure game programs are unusually dense in
- selections and comparison against lists of alternative possibilities. Inform
- now contains many syntaxes to make such comparisons concise:
-
- switching on actions in embedded routines;
- the "switch" statement (added in Inform 5.5) - note that both of these
- deliberately take a more concise form from, say, C, by not allowing
- case fall-through and therefore not requiring the incessant use
- of "break" statements;
- switch ranges such as "5, 7, 8 to 19: ...";
- the "or" alternative operator in conditions, as in "if (x == 1 or 5) ...".
-
- And this, I feel, is something for which I should make no apology.
-
- Later developments (in 5.5) included the Array directive (previously, it was
- customary to use global variables to point to arrays) and the printing
- syntaxes for "print (The) ..." and so forth (see above). An extension was
- made to "Versions 7 and 8" (new hybrids of the Z-machine allowing larger
- games to be compiled), though this had little impact on the language or the
- compiler.
-
- The release of Inform 5.5 notwithstanding, the language remained essentially
- stable during the two years between Inform 5 and Inform 6. For the first
- time it became possible for people other than myself to seriously use Inform,
- rather than toy with it. From Inform 5 onwards, the issue has not been
- adequacy of the language but adequacy of its documentation, and I think this
- is the point where Inform can be counted a "proper" language.
-
-
- 13.3 Implementation history
- ----------------------
-
- Writing a compiler for a large language from scratch requires glacial
- patience, but the result is a mountain. The time invested in tedious details
- during development is repaid ten times over when the compiler comes to be
- used. Temporary measures are infernal temptations, and must be rejected
- firmly.
-
- Inform was not written this way.
-
- The compiler has been through three "design iterations": "zass", which was
- discarded as soon as I thought I had understood how the Z-machine worked,
- Inform 1 to 5.5, and now Inform 6. This section briefly describes the second
- iteration.
-
- Inform 1 was essentially a program for translating assembly language lines
- into Z-code statements. It made the generous and unjustified assumption
- that its input would always be well-formed except for the odd spelling
- mistake, and did not check syntax particularly carefully: lexical analysis
- consisted of grabbing a line from the source code files (accessing these
- files a byte at a time, in an unbuffered way) and cutting out white space
- to leave a set of tokens. Syntax analysis consisted of going through
- these tokens until a line seemed to have worked out all right, and then
- forgetting the line altogether (never checking if there were any more tokens).
- This caused an enormous number of little jobs later, testing for error
- conditions like "expected a semicolon after ... but found ..." in a quite
- unsystematic way.
-
- Not only was there no formal apparatus for syntax analysis, but lexical
- analysis went on all over the place, with strings being cut up and compared
- against the symbols table erratically (and often repeatedly). Tokens were
- stored as strings (with all the consequent inefficiencies of manipulation).
- Direct string comparisons were commonplace, making it difficult to pin down
- what the set of reserved words was, and indeed almost impossible to write
- a precise definition of what programs would pass Inform syntax checking.
-
- When statements required translation into code, this translation would
- actually be made into textual assembly language instructions, fed back into
- Inform via a "preprocessor". (As Inform became more elaborate, this
- preprocessing might take three stages: high level code such as "objectloop"
- constructs being knocked down to equivalent "while" statements and then to
- assembly language lines.)
-
- In order to cope with the problem of forward references to labels, Inform
- made two passes through its assembly language source, which is fairly standard
- behaviour for assemblers (because it is easy to implement and has low
- memory overheads). Yet it became a more and more nightmarish process to
- ensure that high-level constructs would always translate into identical code
- on pass 1 and pass 2, and ever more clear that great effort was being made
- to do a long and difficult calculation twice, burning the result the first
- time.
-
- Given all this, the wonder is that Inform worked at all. In fact it worked
- rather well, after a couple of years of polishing and optimisation: although
- slow in theory, profiling runs suggested that the process of translating into
- textual assembly language (for instance) was not a significant speed loss
- (since opcodes could be textually recognised very quickly, while the work
- in parsing operands had been deferred from earlier on, and would have to have
- been done anyway). Inform performed very rapidly in terms of compilation
- time per source line when compared to typical commercial compilers (which,
- to be fair, were compiling more complex languages) and was unusually low
- in memory consumption.
-
- This last was a serious issue when Inform was released, as many users were
- running it on machines like 0.5M Amigas. Without a grown-up syntax analyser,
- there was no need to store large parse trees (such as most compilers do for
- entire routines at a time: the performance of most compilers seriously
- degrades as the length of a single routine increases).
-
- From Inform 1 to 5.5, the main algorithms and issues never changed, despite
- much reorganisation, renaming, rewriting and extension. But after two and a
- half years, it had clearly "rusted" badly: it was difficult to maintain or
- work on. Bugs were appearing which seemed ridiculous in a program so
- established and widely used: for example, not until 1995 did anybody realise
- that division had been accidentally programmed as right, not left associative
- (thus "12/6/2" evaluated to 4, not 1). An endless mass of ad-hoc rules was
- being added to cover up the lack of proper lexical analysis. Moreover, the
- major new feature -- the linker -- added another massive complication to an
- already convoluted piece of code. (I have by now designed and painfully got
- working three different versions of the linker, and would not go through all
- that again for all the tea in China.)
-
- What began as a general tidying-up and bug-fixing exercise sank into the
- depressing realisation that a total rewrite was the only way to make
- significant progress. Six months later, the writing of this sentence finally
- completes that task.
-
-
- 13.4 Modification history
- --------------------
-
- (i) Changes made between v6.01 and v6.02
- ------------------------------------
-
- Features added:
- "Declared but not used" and "no such constant" errors now reported on or near
- line of occurrence, not at end of source.
- Error message added for local variable being defined twice for the same
- routine
-
- Bugs fixed:
- "Segmentation fault" (i.e., writing memory out of range) fixed to do with
- symbol flags being set wrongly
- Constant binary numbers not lexed correctly
- "for (p = 0::)" misinterpreted due to "::" being recognised as an operator
- Backpatch error under RISC OS (and possibly elsewhere) when linker used
- Grammar token routines reported as "declared but not used"
- Grammar token routines not working at run time
- "ofclass" failing to recognised inherited class membership
- Inheritance of negated attributes not working
- "children" function giving the wrong answer (specifically, always giving the
- object number of the first child); this had the knock-on effect of
- making the "tree" debugging verb produce incorrect output
-
- Source code cleaning:
- Header file rearranged to typedef int32 in the right place, and tidied
- up a little; ICL_Directory defined rather than left to cause an error
- when Inform is compiled; INFORM_FILE replaced with MAIN_INFORM_FILE
- in some of the OS definition blocks; SEEK_SET defined rather than left
- to the vagaries of ANSI
- Type clashes between int and int32 reconciled for:
- assemble_label_no, routine_starts_line, read_byte_from_memory_block,
- write_byte_to_memory_block, parse_label, symbol_index
- Return value for hash_code_from_string cast explicitly to int (rather than
- implicitly from unsigned int)
- prec_table given type int
- Many "dead assignments" (redundant settings of variables) removed
- "=-" replaced by "= -" to prevent pre-ANSI compilers from misunderstanding
- "x=-1" as "x=x-1"
- The veneer string for the source of "CA__Pr" has been contracted to make
- it smaller than 2048 chars long, since Microsoft Visual C/C++ won't
- compile strings longer than that
- symbs[] explicitly cast to char * in a few points in "linker", "objects" and
- "symbols"
- Format string for process IDs corrected from "_proc%08x" to "_proc%08lx"
- Format string for serial number copying removed, and a use of strcpy put
- in its place
-
- (ii) Changes made between v6.02 and v6.03
- ------------------------------------
-
- Feature added:
- The on/off command-line switches can now be prefixed with ~ to turn them off.
- (This is useful only in ICL, to undo the effect of previous lines.)
-
- Bugs fixed:
- The "my parent is Class" value in modules changed from 0xffff to 0x7fff to
- make it valid in type int (rather than int32)
- The "spaces" statement not being parsed correctly (bug in syntax analyser)
- Arrays declared using Global rather than Array don't work (this is fixed,
- but also now causes an obsolete usage warning)
- Fclose applied to null file handle (effectively, the same file closed twice)
- "for (::p++)" misinterpreted due to "::" being recognised as an operator
- Some -> -> object initialisations getting the structure wrong
- Table of property names wrongly compiled (resulting in garbled run-time error
- messages)
- Serious failure of individual property inheritance from a class when both
- inheritor and class define individual property values
- Messages sent to a property whose value is NULL now do nothing and reply 0
- (as if the property value were 0)
- Action switches not working in properties called via messages (well, not
- really a bug: but it's better that they should work, as this makes it
- possible to call "before"/"after" as messages)
- The "children" inlined routine leaving redundant values on the stack (which
- resulted in complex expressions containing uses of children()
- mis-evaluating)
- Actions in the form <A x> and <A x y> using a call opcode improperly (with
- possibly regrettable results -- though they worked on my interpreter!)
- ICL files not working (due to a bug gratuitously introduced in v6.02), and
- not understanding tab characters as spaces, and generally being poorly
- implemented (I found print "Err..."; in the code, to my alarm): generally
- tidied up now
- Negative constants not working as switch() case values
- Slightly better reporting of bad statement errors, and slightly better
- recovery
-
- Source code cleaning:
- Calls to get_next_char() replaced with (*get_next_char)() which is equivalent
- to a modern ANSI compiler, but inequivalent to a pre-ANSI one
- Logically redundant "break" statements added to parse_routine() to prevent
- either "unreachable code" or "no return from non-void function" warnings
- being produced
- parse_given_directive() has its unnecessary parameter removed
- uchar types used for characters as read in by the lexer before filtering takes
- place to, e.g., strip off any set top bits
-
- (iii) Changes made between v6.03 and v6.04
- ------------------------------------
-
- Features added:
- When double-quoted text is continued from one source line to another, and
- the first ends in ^, then the new-line is not replaced by a space " ".
- Thus: print "Shall I compare thee to a summer's day?^
- Thou art more...";
- results in the "T" being printed underneath the "S", not one space to the
- right.
- Statements like
-
- "The wyvern swallows ", (the) noun, " in three gulps!";
-
- are now correctly understood to be print_ret statements. (Previously
- this gave errors because the syntax analyser thought it was going to
- be a list of three switch cases separated by commas, until it hit the
- semicolon and realised there should be a colon, by which time it was
- too late to go back. The language definition has never been clear
- on whether long implied print_rets are legal or not, so I've decided
- that they are.)
- The #n$ construct (for single-character dictionary words) can now take
- non-alphabetic characters: e.g., #n$1 and #n$# refer to dictionary
- words '1' and '#'.
-
- Bugs fixed:
- The Verb "newverb" = "oldverb"; syntax not working, due to mistake in syntax
- analyser (or rather, to inadequate testing)
- If routine R is being Replaced, then its new definition can now appear either
- before or after its Library definition (as long as the Replace declaration
- occurs before both definitions). In 6.01 to 6.03, errors would result
- if the new definition was earlier than the Library one.
- Constants not being allowed as switch cases when they happen to be the same as
- internal statement keywords (such as the single letter "a")
- Memory allocations of 0 bytes now return NULL, which protects Inform from
- trying to apply "free" to them, which seems to have caused problems on
- some ports (not on mine, but it was my mistake)
- Spurious "Second 'Ifnot' in same 'If...'" errors appearing in certain
- nested 'If...' configurations
- A comma in between object headers and bodies is now optional once again (as
- it used to be in Inform 5), rather than forbidden.
- Text or commentary appearing after a string folding character \ now causes
- an error, as it should (the rest of such a line should be empty).
- "continue" not properly working inside a "for" loop of the "optimised" kind
- (that is, whose update code consists of just variable++ or variable--)
- For two different reasons, "or" did not quite work when used in its
- extended way (i.e. with conditions other than ==, ~= or with many
- alternatives): apologies, as the code was in a muddle. Better now.
- Class classname(N) ... was allowing creation of only N-1 new instances,
- not N.
- (a) not recognised as a print specification when "a" is also the name of
- a local variable.
- "youngest" function failing to work.
-
- Source code cleaning:
- Header declarations of the variables "mv_xref" and "no_stubbed_routines"
- removed (neither one actually exists in the released state of Inform 6)
- Miscellaneous comments added
- Assembly of store opcodes optimised to pull opcodes in a few cases, saving
- 1 byte of object code each time
- Hooks for Robert Pelak's MAC_FACE port (traditionally the most strenuous)
- inserted
- Error messages reworded slightly for better consistency
-
- (iv) Changes made between v6.04 and v6.05
- ------------------------------------
-
- Feature added:
- Assembly language tracing slightly tidied up (a cosmetic change only).
-
- Bugs fixed:
- When a file with very long strings in (such as one generated automatically
- by "infoclues") is read, it can corrupt memory, resulting in the malloc
- heap being damaged.
- Spurious backpatching errors (actually, all backpatching errors are spurious -
- backpatching is supposed to work every time) produced when the Z-code
- area of the Z-machine exceeds 64K. (This seldom happens even in large
- games, unless quite long print and print_ret strings are quoted.) The
- error message in question was
-
- *** Backpatch symbol out of range ***
-
- and, just to recap, this should never appear. Please send me the
- offending source code if it persists!
- The only time I've seen this bug is that it once hung the Z-machine while
- working on printing an inventory of items which needed grouping using
- "list_together", but it's actually a mistake in the shift-reduce
- expression grammar: "(...)" (function call) has precedence level 11,
- whereas "." has level 12, in order to make messages
- object.message(...)
- parse correctly (i.e., as (object.message)(...)). This isn't usually
- done in SR operator grammars because, as I now realise, it results in
- function(X).property
- being misinterpreted as function(X.property). I've corrected this by
- giving (...) the asymmetric precedence level of 11 on the right, but
- 14 on the left.
- Printing dictionary contents to a text transcript file not working on the
- Amiga port, since an internal buffer was overwritten by one byte
- (5x16+1 = 81, not 80).
- Spurious "variable declared but not used" warnings appearing for a variable
- used only as a store destination in assembly language.
- The "box" statement producing boxes of the wrong width when the text contains
- the @ escape character (e.g. "@@92@@92@@92" used to be wrongly
- calculated as 12 characters wide, when it really consists of three
- backslash characters).
- The v6.04 optimisation using "pull" (see above) didn't work in v6, v7 or v8
- (for two different reasons, but basically because the opcode behaves
- differently in these as compared with lower versions).
- Assembly language in v8 recognising the wrong instruction set (same as
- the previous bug).
- Version 6 games not properly compiled in some cases (because of a rounding
- error in calculating packed address offsets: a new section 8.8 has been
- added to this manual to document how Inform works these out).
-
- "I compiled Inform 6.05 under Purify, and am pleased to report that Purify
- reported exactly 0 memory access errors and 0 bytes of memory leaked."
- -- Brad Jones
-
-
- (v) Changes made between v6.05 and v6.10
- ------------------------------------
-
- Features added:
- The four ICL path variables which define places to look for input of
- different kinds (source code, include files, ICL files and modules)
- can now hold lists of alternative locations, separated by a character
- called FN_ALT which may be OS-dependant (but is normally ',').
- These alternatives are tried left-to-right until the desired filename
- is found. It's legal for an alternative to be empty, meaning the
- current position (e.g. "+include_path=,library,oldlib" gives three
- possible paths -- the current directory, then the library and oldlib
- subdirectories). The on-line examples (in the -h1 help information)
- now include this new feature.
- File and pathnames in ICL and on the command line can now contain spaces
- if written inside double-quotes: e.g., "Games Folder/New Games" would
- be a valid ICL pathname. (Provided for benefit of Macintosh users.)
- A new error message format, -E2, uses Macintosh Programmer's Workshop style.
- Output game files in the MPW port are given the type and creator for the
- MaxZip interpreter. (With thanks to Tom Emerson for supplying details.)
- The compiler can now generate a new format of grammar table: if used with
- old code, it will behave just as it used to, but if used with Library
- 6/3 or later will generate "GV2" (grammar version 2) tables. Users
- will notice that several restrictions are lifted, most usefully:
-
- the number of tokens per grammar line can be up to 30, not up to 6;
- the total number of prepositions is now unlimited (it used to be
- limited to about 80);
- the total number of routines named in grammar tokens is now
- unlimited (it used to be limited to 32);
- the total number of actions per game can be up to 4096, not 256.
-
- (Chapter 8 above has been rewritten to document GV1 and GV2 in full.)
- In addition, there are three new grammar features:
- (a) you can mark an action as "reverse", as in the following example:
-
- Verb 'show' 'present' 'display'
- * creature held -> Show reverse
- * held 'to' creature -> Show;
-
- "reverse" indicating that the two parameters are to be reversed
- in order before the action is generated.
- (b) you can give two or more prepositions (only) as alternatives:
-
- Verb 'sit' 'lie'
- * 'on' 'top' 'of' noun -> Enter
- * 'on'/'in'/'inside' noun -> Enter;
-
- the "/" markers indicating alternative prepositions, any one of
- which is equally good.
- (c) there is a new grammar token called "topic", which matches any
- block of text up to the next preposition or the end of the input.
- For instance,
-
- Verb 'read'
- * noun -> Examine
- * 'about' topic 'in' noun -> Consult
- * topic 'in' noun -> Consult;
-
- It's used mostly for reading material and subjects of conversation,
- hence the name "topic". For how to deal with the results of parsing
- a topic, see the Designer's Manual on "Consult".
- These three features are only available when using Inform in conjunction
- with Library 6/3 or later.
- The run-time veneer's implementation has been rewritten, especially in the
- areas of message-sending and superclass access. Several benefits:
- (a) the :: superclass operator can be used with any property
- (in Inform 6.05 it only worked with individual properties);
- (b) it makes sense to look up or send to
-
- something.Object::cant_go
-
- and this refers to the default value of "cant_go". (So we can
- think of common properties as being exactly the ones inherited
- from class Object.)
- (c) printing out a property name, as in
-
- print (property) Crown::colour
-
- now correctly prints "Crown::colour". (Previously it got stuck
- on superclass properties.)
- (d) the limits on numbers of things are made more sensible. These
- are now as follows:
- (i) up to 256 classes per game;
- (ii) up to 128 different individual properties (and up to 63
- common properties) can be provided by any single object;
- (iii) up to 32703 individual property names throughout the game.
- These limits are now enforced by the compiler, which used to
- lazily not check them.
- (e) in games using the Inform library, when the variable debug_flag
- has bottom bit set (the Inform library does this when the
- "routines" verb has been typed), all messages are traced to
- the screen. (Except any messages caused as a result of trying
- to perform this printing.) This is fuller and more helpful
- than the old "routines" output.
- (Several parts of chapter 9 above have been corrected.)
- The dictionary data structure is entirely redesigned, as a red-black tree
- rather than a hash table. The effects of this change are localised
- to "text.c" (i.e., no other source files were modified to achieve it).
- For small games no difference will be noticed; for large games there will
- be a slight speed improvement, becoming more noticeable as the game grows.
- (E.g.: on my machine "Curses" compiles 4% faster.) See Section 8.4 above
- for details.
- As part of a general move toward multilingualism, the notation for dictionary
- words is extended: 'word//letters' means the word 'word' with some flags
- attached. At present, the only such flag is 'p', meaning "plural".
- Note that 'word//' is legal and equivalent to 'word'. Thus, 'w//'
- is another way of making the dictionary word #n$w (which we can't write
- as 'w' because that's only a single character).
- (This information is stored in a previously unused flag in #dict_par1:
- see the bitmap in section 8.5 above.)
- Dictionary words may now contain accented characters (this is likely to be
- essential for some Scandinavian languages). It is also now legal to
- write a quotation mark within quotation marks, in two cases:
- ''' (meaning: the literal character ')
- '@'etude' (to put an acute accent on something)
- To avoid dictionary resolution falling unacceptably when accented chars
- are used, it's helpful to move commonly occurring ones into the
- Z-machine alphabets: the new directive "Zcharacter" does this.
- See section 12.2 above.
- The dictionary contents are now given in alphabetical order in text transcript
- files, and the "Trace dictionary;" output is much more descriptive.
- Tracing output for calls to functions marked with a * is more legible,
- giving exactly the arguments supplied and naming embedded routines
- more sensibly (e.g. as "lantern.time_left" rather than "Embedded__43").
- The -g switch now traces only the user's own functions, not the library
- ones, unless -g2 is set.
- Inform now checks for the error of giving the same property twice in one
- definition. This is not always as obvious as (say) giving the
- "description" of an object twice over, because of the existence of
- aliases. A typical error message would be:
-
- line 15: Error: Property given twice in the same declaration,
- because the names 'initial' and 'when_open' actually refer
- to the same property
-
- A warning is produced if the "box" statement is used in a Version 3 game
- (since it cannot have any effect in such a game).
- The default memory allocation for arrays has been increased (from 4000 to
- 10000 bytes). The original was too conservative, and anyway more will
- be needed for language definition files in library 6/3.
- Action and attribute names are now written into story files, so that it's
- possible to print them (for debugging purposes). Full details of how
- to do this are in section 9.6 above. (Users with library 6/3 may notice
- that the "actions" debugging verb now names all actions, not just the
- ones defined by the library.)
- Inform now allows the three special constants DEBUG, USE_MODULES and
- MODULE_MODE to be defined more than once without giving an error.
- Thus, if your code contains "Constant DEBUG;" already, but you
- compile it with the "-D" option anyway, no error will be produced.
- When an object isn't given a textual name, rather than calling it "?"
- Inform now calls it something like "(red_box)", where "red_box" is
- its internal symbol name. (This makes debugging output easier to
- follow and wastes very few bytes.) When the object hasn't got an
- internal name either (if it's defined just as "Object;" for
- example) it has the textual name "(102)", 102 being its object number.
-
- Bugs fixed:
- "Extend only ..." not always parsed correctly, sometimes giving a spurious
- syntax error, other times failing to notice "first/last/replace".
- In -U mode, failure to report as an error a line consisting of two words, the
- first of which is a constant or variable name defined in a previously
- linked module. (This arose when the line "Action RevTakeWith" was
- accidentally written instead of "Fake_action RevTakeWith" -- what
- happened was that "Action" was recognised as the variable "action",
- which Inform knew was a symbol exported from a linked module (ParserM);
- it took this value as a class, wrongly, and made a new object called
- RevTakeWith of this "class". The fault was in the syntax analyser,
- which wrongly assumed that it wouldn't necessarily be known whether an
- exported symbol was a class-name or not, and so allowed all exported
- symbols to be used, just to be on the safe side.)
- ICL fatal error messages quoting a garbled filename.
- Version 3 games (ah, remember them?) not compiling because of the usage of
- call opcodes not present in the V3 Z-machine
- The "Include ">name"" feature not working when the original source file
- lies at the current working directory of the host OS, rather than in
- some subdirectory.
- The linker had a rarely-occurring bug which prevented the tasks scoring
- system from working properly in a game which linked rather than included
- the library modules: if an instruction wrote a value to certain variables
- whose internal numbers were in a certain narrow range, then these
- variables would not be correctly reconciled with the story file variables
- at link time. This same bug possibly also accounts for an even rarer
- occurrence, which I've had no definite report of, in which the less than
- professional error message "Ooops" is produced. (I've corrected the
- error message to something more helpful.)
- Rather seriously (though this error took a long time to be found),
- if (condition) rfalse; else ... [or the same thing with rtrue;]
- producing syntax errors. (A mistake in some optimisation code.)
- The "elder" function wrongly returning 0.
- Not always reporting errors when global variables were used wrongly as
- values of object properties or array elements. (Sometimes a backpatch
- error would be given.) These are now properly parsed in constant
- context, not quantity context.
- Spurious warning about code not being reachable at the close brace of
- a "for" loop, when this is indeed not reachable. (It can still be
- sensible code, if the idea is always to "continue" or "return"
- from the body of the loop.)
- When the error "Error: System function name used as value "child"" is
- produced, also printing a spurious *** Emit token error *** message.
- Backpatch errors sometimes produced when compiling a line (in a module only)
- which contains an "if" condition making heavy use of "or".
- Code backpatch errors sometimes produced when the condition "ofclass" is
- compiled.
- '->' or 'Nearby' mysteriously failing if the parent object referred to
- was (a) defined in a module and (b) one of the first four objects
- defined there.
- "Declared but not used" warnings issued for replacements of routines
- only accessed from within a linked module.
- Error message if a non-existent ICL variable set, not being printed.
-
- Source code cleaning:
- A port definition called PC_WIN32 has been added to "header.h", courtesy of
- Kirk Klobe.
- Two variables made int32 instead of int (thanks to Evan Day for spotting
- the desirability of this).
- The veneer routines are segmented so that no individual literal string
- is as long as 512 characters. (An irritating requirement imposed by
- the Macintosh port's compiler.)
- Six miscellaneous explicit casts of (char *) to (uchar *), suggested by
- Robert Pelak. More systematically, the opcode names in asm.c are all
- explicitly cast to (uchar *).
- A slight change to inform.c makes the Mac front end's arrangement (of
- calling only sub_main(), while main() itself is not compiled) more
- generally usable by other ports: see section 2.2 (e).
-
-
- (vi) Changes made between v6.10 and v6.11
- ------------------------------------
-
- Bugs fixed:
- An important one -- the optimised form of statements like
- if (variable) rfalse;
- was being negated, i.e., was being wrongly compiled as
- if (~~variable) rfalse;
- (In Library 6/3, this showed up at one stage as odd behaviour when
- moving around in darkness.)
- The statement "spaces 0" printing infinitely many spaces instead of
- doing nothing. (Negative arguments have always done nothing.) It is
- possible to provoke this behaviour in calls to the list-writer.
- Bug to do with parsing quoted filenames in ICL.
- Spurious "declared but not used" warning for global variables used only
- in a module being linked in.
-
- Source code cleaning:
- Atari ST definition block updated on advice of Charles Briscoe-Smith.
- Copyright dates nudged on to 1997.
- Paranoid test added when closing temporary files.
- Macintosh interface code added on advice of Robert Pelak.
- Two long printfs subdivided to assist poor compilers.
- String subdivision in "CA__Pr" in the veneer moved slightly in a way
- which makes no logical difference, but which seems to fix a mysterious
- problem observed by Robert Pelak (possibly with his compiler: it has
- been observed on no other platform).
- Text transcript file is now opened as a text file, not a binary file,
- which will hopefully assist some ports but inconvenience nobody.
- (All the same, it may be worth testing that text transcripts still
- look sensible on your port.) Code added to the Macintosh port to
- ensure closure of the file if compilation aborts.
-
-
- (vii) Changes made between v6.11 and v6.12
- ------------------------------------
-
- Feature added:
- A new switch, -C, and the ability to read source code files in any
- ISO 8859 standard extension to the ASCII character set. The
- default is ISO 8859-1, Latin1, the standard character set on
- most computers in most European countries. This means that
- many accented characters can simply be typed directly, a
- particular advantage with exotic ISO ranges (Arabic, Hebrew,
- Greek, Cyrillic, etc.). If your computer is unable to use any
- of 8859-1 to -9, the -C0 switch (for plain ASCII only) can be
- used. Accent markers such as '@:u' (for u-diaeresis) remain
- valid, just as usual, but a new string escape '@{..hex number..}'
- allows the specification of any Unicode character. For instance
- '@{a9}' produces a copyright sign (Unicode values between $0000
- and $00ff are equal to ISO Latin1 values), and '@{2657}' produces
- in principle a White bishop chess symbol. In practice, if you
- wish to use rare Unicode symbols, you'll need an interpreter
- obeying the Z-Machine Standard 1.0, and you'll probably have
- to supply it with an appropriate font as well; also, any characters
- not normally resident in the Z-machine need to be declared in
- advance of use with the "Zcharacter table" directive. But if
- you're only using (e.g.) ordinary German, Spanish or French
- accents the new facilities will work with any Standard 0.2
- interpreter.
- The "Zcharacter" directive, for configuring the Z-machine to
- use non-English alphabets, is greatly enhanced.
- (In the Z-machine, Inform now always generates a header extension
- table, and sometimes uses a slot within it to point to a Unicode
- translation table newly specified in Standard 1.0.)
- The debugging information file has finally been brought up to date with
- Inform 6, and old code inherited from Inform 5 (the only such
- code in the whole compiler) has been cleaned out. The main change
- is in producing source-code-to-PC-value maps, which is made more
- difficult by the compression of code at the end of each routine.
- The exact format will probably change: this release is an interim
- one, to assist development of Infix.
- The concept of sequence points has been introduced, to assist Infix.
-
- Bugs fixed:
- Not a bug as such, but the veneer implementation has been changed so that
- class-objects never have attributes set. In 6.01 to 6.11, a class
- object like Monster might have the attribute "animate" and would
- therefore be picked up in objectloops over all animate objects, and
- so on. This is undesirable, and illogical since after all class
- objects don't provide any properties, so why have attributes?
- Anyway, the result means that objectloops of the form
-
- objectloop (x has <attribute>) ...
-
- will now not range through classes but only genuine game objects.
- When compound Boolean conditions are used as expressions (for instance,
- in an assignment like "x = (y>1) || d;") the wrong answers could
- result if the top operator was ||. (If it was &&, or if the
- condition had no &&s or ||s in, or if the condition was being used
- as a test rather than a numerical value, everything was normal.)
- Fake actions being defined before grammar version is set were causing
- mysterious problems. This would happen if the Fake_Action directive
- were used before "Parser" is included. An error message has been
- added advising that Fake_Action directives should be moved.
- (Fake actions are now mostly obsolete anyway.)
- Any attributes aliased together being given the unhelpful name
- "<unknown attribute>" in output from the "showobj" debugging verb.
- Backpatch *** errors *** sometimes being produced as a knock-on effect
- from already-reported linkage errors. (These *** errors *** are
- now suppressed if linkage has already broken down.)
- The character codes for << and >> (Continental European quotation marks)
- were the wrong way around, following a mistake in version 0.2 of
- the Z-Machine Standards document. They have therefore been
- swapped over, following the correction made in version 1.0 of that
- document.
- Accented characters deep inside dictionary words sometimes getting lost.
- (Because of an ambiguity in the Standards document where Inform
- made the wrong choice, but interpreters made the right one.)
- The -w (suppress warnings) switch doing nothing. (In writing Inform 6
- I simply forgot this.)
-
- Source code cleaning:
- A new UNIX64 port, for 64-bit Unix machines, added (courtesy of Magnus
- Olsson) which should subtract pointers correctly even in the unlikely
- event that they lie in different 2^32 byte pages. Otherwise identical
- to UNIX.
- And a new BEOS port added (courtesy of Michael van Biesbrouck) but which is
- identical to the Linux one in all but name.
- Link error message handling, and character error message handling,
- slightly more automated.
- The "Compiled with %d errors and %d warnings" message reworded so as not
- to mention 0 of anything, and to indicate how many warnings were
- suppressed (by the -q or -w switches).
- The table of memory settings is now printed more prettily in response
- to the ICL command $list.
- A new memory setting, MAX_LABELS, has been added, along with error checking
- to test for too many label points in any single routine.
- Various functions to do with character set handling tidied up and moved
- into the new section "chars.c".
-
-
- (viii) Changes made between v6.12 and v6.13
- ------------------------------------
- Bug fixed:
- String escapes in the form "You can't go @00 from here." not working
- (slip of the keyboard when re-writing string escapes in v6.12).
-
- Source code cleaning:
- The type of alphabet[][] has been altered so as to ensure that the
- contents are not read-only strings:
- from char *alphabet[3] to char alphabet[3][27].
-
-
- (ix) Changes made between v6.13 and v6.14
- ------------------------------------
-
- Bugs fixed:
- "Parker's disease": In very large games only, backpatch errors being
- caused because the Z-code area of the Z-machine had overflowed
- past the size where the compiler could backpatch it (this tended
- to happen only if large amounts of text were printed by Inform code
- rather than being property values). The Inform code generator now
- automatically writes any string of text longer than 32 characters
- into the static strings area: designers shouldn't notice any difference
- in behaviour of the "print" or "print_ret" statements.
- (The new arrangement is fractionally wasteful of bytes -- about
- 2 to 4 bytes per string in excess of 32 characters -- but makes
- a big difference to the relative sizes of the code and strings
- areas. "Advent" compiled the old way was 60.1% code, 17.3% strings;
- the new way, it comes out 40.6% code, 37.1% strings and is about
- 1K bigger. Most of the change occurs when compiling library
- messages; the effects are less dramatic on typical Inform code.)
- A different cause of backpatch errors is attempting to link Version 5 modules
- into a Version 8 game. If you're compiling a V8 game and want to
- link in the library, for instance, you'll need to compile the
- library modules with "-v8" as well. Inform used not to check this
- error condition, but now does produce an error message if there's
- a version mismatch between game and module.
- The linker also now checks that module and game are using the same
- Z-machine character set. (If either one has made a Zcharacter directive
- which differs from the other, then the sets are inconsistent and
- linkage can't take place. This will only ever happen if you're
- working in a language other than English, and trying to use the linker,
- and it can be cured by copying the Zcharacter directives out of the
- language definition file into a little include file, and including
- that in the source for your main file as well as the source for any
- modules of your own -- i.e., any modules other than the library ones.)
- Finally, the linker checks for the error condition that the module asked
- to import a variable which turns out not to have been defined in the
- main game. (This may cause problems in linking with library 6/6, as
- Inform is now able to detect an error in 6/6's "linklv.h" which it
- previously wasn't able to detect: delete the line reading
- "Import global gender_and_number;" from "linklv.h" if so.)
- "continue" not working (branching to the non-existent label -1, in
- fact) if used inside a "switch" statement. (It ought to continue
- the innermost loop containing the "switch" statement, if there is
- one, or give an error if there isn't.)
- The two string escapes @@94 (producing a ^ character) and @@126
- (producing ~) in fact producing new-line and double-quote.
- Conditional branches to rtrue or rfalse in assembly-language being
- reversed in sense.
- Source code being garbled after any Include statement which ends at
- a byte position which is a multiple of 4096, minus 3. (My thanks
- to Andrew Plotkin for both finding and diagnosing this.)
- "For" loops in the form "for ( : : )" (with white space between the two
- colons), which occur in a routine after a previous one which happens
- to have a designer-defined label in a particular position, causing
- an incorrect if harmless "Label declared but not used" warning.
- Very short programs testing "x provides <property>", but never reading,
- writing or otherwise using <property>, causing stack overflows or
- restarts. Nobody would ever need such a program anyway.
- Function calls not working, or having garbled arguments, in Version 3
- or Version 4 games (the Inform library requires Version 5 or better
- nowadays, so the ability to compile V3 or V4 files is only needed
- for interpreter testing).
- The error message disallowing use of an assembly opcode if the Z-machine
- version doesn't support it, now quotes the opcode name.
- Suppressed "declared but not used" warnings for global objects not being
- counted towards the total number reported as suppressed when Inform
- prints its final message.
-
- Source code cleaning:
- Three character conversions amended to unsigned character conversions.
- Backpatch errors, which I devoutly hope not to see again, are now
- reported as a new category of error: "compiler errors" which produce
- an apologetic banner, asking the user to report the fault to me.
- Link errors are reported more prettily (differently, anyway).
-
-
- (x) Acknowledgements
- ----------------
-
- Many thanks to the following informers, who have reported bugs in Inform 6:
-
- Torbj|rn Andersson, Jose Luis Diaz de Arriba, Paul E. Bell,
- Michael van Biesbrouck, Gerald Bostock, Neil Brown, Russ Bryan,
- Jesse Burneko, Evan Day, Stephen van Egmond, Tom Emerson, Greg Falcon,
- Roy Fellows, Rob Fisher, David Fletcher, C. E. Forman, Richard Gault,
- Paul Gilbert, Michael Graham, Bjorn Gustavsson, Andreas Hoppler,
- Sam Hulick, Brad Jones, Christoffer Karlsson, Niclas Karlsson,
- John Kennedy, Matt Kimball, Kirk Klobe, Michael A. Krehan, Mary Kuhner,
- Josh Larios, Harvey Lodder, Bonni Mierzejewska, Tim Muddleton,
- Olav Mueller, Ross Nicol, Magnus Olsson, Marnie Parker, Robert Pelak,
- Aphoristic Petrofsky, Andrew Plotkin, Fredrik Ramsberg, Matthew Russotto,
- Miron Schmidt, Rene Schnoor, Nyuchezuu Shampoo, Anson Turner,
- Uncle Bob Newell and all.
-
- ------------------------------------------------------------------------------
-